EPORT  DOCUMENTATION  PAGE 


1b  ‘RESTRICTIVE  MARKINGS 


ib  OEC;^SSiFiCA’^‘ON  .  DOWNGSAO'NG  SCmEDuLE 


4  P£RPO‘RM,NG  ORGANIZATION  REPORT  NUMBERlS) 


6a  NAME  OF  PERFORMING  ORGANIZATION 

David  M.  Allen,  Ph.D. 


6b  OFFICE  SYMBOL 
(If  applicibl*) 


7b  ADDRESS  (Ofy,  Stafe,  and  Z/PCo<<e) 

105  Kinkead  Hall 
Lexington,  Kentucky  40506-0057 


(If  applicable) 


6c  ADDRESS  (City  State,  and  ZIP  Code) 

Department  of  Statistics  -  859  P.O.T. 
University  of  Kentucky 
Lexington,  Kentucky  40506 


3a  NAME  OF  Funding  SPONSOPiNG 
ORGANIZATION 

Office  of  Naval  Research 


8c  ADDRESS  fCify,  State,  and  Z/P Code) 

800  North  Quincy  Street 
Arlington,  Virginia  22217-5000 


11  title  (Include  Security  Clatsification)  „  .  ^  r.  ^  r  • 

Proceedings  or  the  Seventeenth  Symposium  on  the  Interface  of  Computer  Science 
and  Statistics  (unclassified) 


1Z  personal  AUTHOR<S)  .  ,  .  jr 

See  attachment  for  list  of  authors,  affiliations,  and  titles  of  papers 


l9  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBEI^^^^^ 

N00014-85-G-0157 

b 

1  10  SOURCE  OF  FUNDING  NUMBERS 

PROGRAM 

PROJECT 

TASK 

WORK  UNIT 

ELEMENT  NO 

NO 

NO 

ACCESSION  NO 

13a  TYPE  OF  report 

As  submitted  for  publ 


M.  DATE  OF  REPORT  (Year,  Month,  Day)  Il5.  PAGE  COUNT 

March  4,  1986:  310 


COSATl  COOES 


GROUP  SU3GROUP 


18  SUBJECT  TERMS  (Continue  on  reverje  if  nesetsary  and  identify. by  block  number) 

Time  Series,  Nonlinear  Models,  Repeated  measures  Data 
Analysis,  Categorical  Data  Analysis,  Artificial  Intelligence, 
Metadata  of  Computational  Processes,  Statistical  Computer 


19  ABSTRACT  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

The  Seventeenth  Symposium  on  the  Interface  of  Computer  Sciences  and  Statistics 
SI  was  held  in  the  Radisson  Plaza  Hotel,  Lexington,  Kentucky  on  March  17-19,  ^85, 
lo  was  hosted  by  the  University  of  Kentucky.  The  sessions  encompassed  a  broad 

fcS  range  of  topics.  A  number  of  sessions  dealt  with  computational  methods  for 
1  traditional  statistical  areas.  These  included  Time  Series,  Nonlinear  Models, 

I-*'-'  Repeated  Measures  Data  Analysis,  and  Categorical  Data  Analysis.  Some  sessions 
fcrr.  were  in  the  relatively  new  areas  of  Statistics  such  as  Artificial  Intelligence, 

C-  the  Metadata  of  Computational  Processes,  Statistical  Computing  Languages,  and 

Statistical  Workstations.  There  were  also  sessions  in  Numerical  Methods,  Density 
Estimation,  Teaching  of  Statistical  Computing,  Statistical  and  Mathematical 
Software  and  Graphics.  During  one  session  the  entire  audience  participated  in  a 
round  table  discussion  on  the  Performance  of  Statisticians  with  Statistical 
Software.  Written  versions  of  nearly  all  these  papers  are  in  this  volume.  ( 


20  DISTRIBUTION  AVAILABILITY  OF  ABSTRACT  21  ABSTRACT  SECURITY  CLASSIFICATION 

EuNCLASSIFIEQiUNlIMITED  □  same  as  RPT  □  OTIC  users 
22a  NAME  OF  RESPONSIBLE  INDIVIDUAL  22b  TELEPHONE  (Indudf  Area  Code)  22c  OFFICE  SYMBOL 

Dr.  David  M.  Allen  |  (606)  257-6901 

DO  FORM  1473,  84  MAR  83  APRed  tion  may  b*irt*d  until  ei«h*o$Ted  SECURITY  CLASSIFICATION  OF,  THIS  PAGE 

All  other  editions  are  obsolete 


□  OTIC  USERS 


Area  CodeH  22c  OFFICE  SYMBOL 


DO  FORM  1473, 84  MAR 


security  classification  of,  this  page 


s.v 


Block  18: 


Languages,  Statistical  Workstations,  Numerical  Methods,  Density 
-£stimatjjaiu.Ieachi|g-a.f statistical  Computing,  Statistical  and 


i  .  ■ 


1  K” 


Written  versions  of  nearly  all  these  papers  are  in  this  volume.  A 
papers  were  not  included  because  of  prior  copyright  elsewhere  or  because 
the  manuscript  was  not  received  from  the  authors. 


This  work  relates  to  Department  of  Navy  Grant  N00014-85-G-0157  issued 
by  the  Office  of  Naval  Research.  The  United  States  Government  has  a 
royalty-free  license  throughout  the  world  in  all  copyrightable  material 
contained  herein. 


Please  include  on  copyright  page, 


8  (5 


129 


DISCLAIMER  NOTICE 


THIS  DOCUMENT  IS  BEST  QUALITY 
PRACTICABLE.  THE  COPY  FURNISHED 
TO  OTIC  CONTAINED  A  SIGNIFICANT 
NUMBER  OF  PAGES  WHICH  DO  NOT 
REPRODUCE  LEGIBLY. 


Preface 


■  -'The  Seventeenth  Symposium  on  the  Interface  of  Computer  Sciences  and  Statistics  was  held  in  the 
Radisson  Plaza  Hotel,  Lexington,  Kentucky  on  March  17-19,  198S.',  The  conference  was  hosted  by  the 
University  of  Kentucky.  The  format  for  the  Symposium  was  very  Similar  to  the  preceding  symposia  in 
the  series.  Or.  John  Nash  presented  the  keynote  address  on  Monday  morning.  This  was  followed  by  two 
sets  of  three  parallel  sessions  and  wo'rkshops.  On  Tuesday  there  were  three  sets  of  three  parallel 
sessions. 

The  sessions  encompassed  a  broad  range  of  topics.  A  number  of  sessions  dealt  with  computational 
methods  for  traditional  statistical  areas.  These  included  Time  Series,  Nonlinear  Models,  Repeated 
Measures  Data  Analysis,  and  Categorical  Data  Analysis.  Some  sessions  were  in  the  relatively  new  areas 
of  Statistics  such  as  Artificial  Intelligence,  the  Metadata  of  Computational  Processes,  Statistical 
Computing  Languages,  and  Statistical  Workstations.  There  were  also  sessions  in  Numerical  Methods, 
Density  F.stimation,  Teaching  of  Statistical  Computing,  Statistical  and  Mathematical  Software  and 
Graphics.  During  one  session  the  entire  audience  participated  in  a  round  table  discussion  on  the 
Performance  of  Statisticians  with  Statistical  Software.  Written  versions  of  nearly  all  these  papers 
arc  in  this  volume..  A  few  papers  were  not  included  because  of  prior  copyright  elsewhere  or  hccause 
the  manuscript  was  not  received  from  the  authors. 

A  large  number, of  people  helped  make  the  Seventeenth  Symposium  a  big  success.  The  organizing 
committee  was  Gary  Anderson,  Kenneth  Berk,  Thomas  J.  Boardman,  Daniel  B.  Carr,  William  F.  Eddy,  Alan 
R.  Forsythe,  Riclydrd  J.  Heibcrger,  Sally  E.  Howe,  Robert  E.  Kass,  William  Kennedy,  J.  Richard  Landis, 
John  Nash,  Wesley  L.  Nicholson,  Gordon  Sande,  Victor  Solo  and  Constance  L.  Wood. 

The  office  staff  of  the  Department  of  Statistics,  particularly  Debra  Artorborn  and  Brian  Moses, 
oversaw  the  correspondence  and  bookkeeping,  maintained  a  participant  data  base,  assembled  registration 
packets,  and  manned  the  registration  desk.  Wimberly  C.  Royster,  Dean  of  the  Graduate  School,  M.  A. 
B.-jer,  Dean  of  the  College  of  Arts  and  Sciences,  and  Joseph  M.  Gani,  Chairman  of  the  Department  of 
Statistics,  were  all  very  supportive  and  made  many  resources  of  the  University  available  for  the 
Symposium. 

The  facilities  of  the  Radisson  Plaza  Hotel  were  extremely  nice.  Thanks  arc  extended  to  Cindy 
Edwards  and  the  rest  of  the  Radisson  staff.  The  Greater  Lexington  Convention  and  Vistors  Bureau 
welcomed  participants  at  the  airport,  provided  literature  on  things  to  do  and  places  to  cat,  and  also 
helped  with  the  registration. 

The  American  Statistical  Association  was  helpful  in  many  ways.  The  efforts  of  Randall  Spocri 
and  Jean  Smith  are  particularly  appreciated.  Financial  support  for  the  Symposium  came  from  the 
Office  of  Naval  Research  and  the  University  of  Kentucky. 
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TAKING  IT  WITH  'i  OU  --  PORTABLE  STATISTICAL  COMPUTING 
John  C.  Nash 


Faculty  of  Adml n 1 • t ra t i on 
University  of  Ottawa 
Ottawa,  Ontario,  KIN  6N5 

Canada 


The  subject  of  the  presentation  is  the  needed  or  wanted  basis  for 
portable  statistical  computing  —  the  infrastructure  statisticians 
should  have  in  order  to  carry  out  desired  statistical  computations 
wherever  they  happen  to  be.  Expanding  on  this  theme,  we  will  examine 
what  this  basis  implies  for  statistical  software,  the  data  sets  we 
examine,  our  own  practices  and  "documentation”  in  the  widest  sense,  the 
computing  hardware  and  software  environments  useful  to  support  this 
activity,  and  the  standards  needed  to  assist  us  in  rendering  our  work 
por  tab  1 e . 


INTRODUCTION 

As  an  active  user  and  promoter  of  smalt 
computer  solutions  to  both  scientific 
and  general  administrative  problems, 
and  as  scientific  computing  editor  for 
B/te  magaiine,  I  am  clearly  identified 
with  that  pr o 1 i f er at i ng  technology 
collectively  called  the  “microcomputer 
resolution".  However,  the  main 
objective  of  this  pr  esen  t  i  on  does  NOT 

r.incern  microcomputers,  except  where 
♦he  gadgetry  illustrates  how  obstacles 
io  po'*  t  ab  i  1  i  t  y  of  statistical  computing 
.-I  ise  or  may  be  overcome.  In  woiting 
t  iia'<e  our  work  as  free  f  r  ties  to 
j.-ngiaphic  locations  as  possible,  I 
firmly  believe  that  clear  thinking  and 
a  v»jde  perspective  are  fa>  iitor  e 
important  than  brilliance  in  the  design 
of  a  specific  piece  of  hardware  or 
sof  tware. 


statistical  computing  --  DEFINITIONS 

The  basis  of  statistical  computing  has, 
in  my  opinion,  five  facets: 

1)  methods  for  data  analysis  and 
statistical  Interpretation 

Z)  data  which  is  tc  be  the  subject  of 
analysis  or  computation 

3)  documentation  of  what  WE  --  the 
statisticians  --  do,  that  is,  of 
statistical  practice 


4)  tabulation  and  display  mechanisms, 
which  are  separated  from  methods  to 
reflect  the  necessary  involvement  of 
machinery  to  effect  the  desired 
outputs 

5)  the  training,  education  and 
research  (self-education  of  the 
profession)  to  improve  the  overall 
technology  of  statistical  computing  as 
pr  ac  t  i  sc  d  . 

Here  we  do  not  consider  the  analysis  of 
the  results  of  computations  as  part  of 
the  task  at  hand.  However,  this 
distinction  is  blurred  by  the 
development  of  expert  systems  for 
particular  areas  of  statistics. 

The  basis  of  statistical  computing 
listed  above  is  in  the  domain  of  ideas. 
Their  realization  is  the  work  upon 
which  many  of  us  labour.  We  endeavor 
first  to  render  the  ideas  in  greater 
detail  as  generalized  software  -- 
computer  programs,  data  files  and 
structures,  books,  research  papers, 
presentations,  and  designs  of  graphics. 
Second,  wie  try  to  put  the  ideas  into 
the  "hardware"  forms  —  disks  and 
tapes,  paper,  integrated  circuits, 
aud 1 o / V i sua 1 s .  The  juxtaposition  of 
these  sof twar e / har dwar e  ideas  is 
deliberate,  in  that  it  focuses 
attention  on  the  possibility  that  there 
mav  be  several  renderings  of  an  idea  in 
different  "languages"  of  expression  and 
different  media  of  recording. 


PORTABILITY 


One  can  think  o4  meveral  routes  to  make 
serious  statistical  computln9  portable. 
Portable  personal  computers  of 
considerable  power  are  now  available! 
some  of  which  are  battery  powered  and 
need  no  AC  power  supply.  The  hardware* 
however,  needs  to  be  complemented  by 
suitable  software,  and  our  data  must  be 
at  hand  in  a  useful  form.  Neither  of 
these  latter  requirements  is  currently 
satisfied,  but  the  availability  of  the 
machinery  will  entice  developments 
to  appear  over  the  next  few  years. 


particularly  John  Tukey,  have  presented 
similar  cateaor 1 zat i on  lists.  In 
transferring  data  from  one  set  of 
workers  to  another,  we  must  take 
account  of  some  or  al 1  of  the  above 
attributes.  The  task  of  developin9  a 
generalized  format  to  accommodate  these 
needs  is  not  a  trivial  one.  With  Fred 
Brown,  a  research  assistant,  1  have 
tried  to  develop  such  a  format,  but  d 
not  yet  feel  satisfied  that  it  is  rea 
to  pub  1 i sh . 


TABULATION  AND  DISPLAY 


To  gain 
computei 
data  se 
por tab 1 
t  ang-d i 
term i na 
are  on 
of  data 
display 
or  data 
suf f 1 c i 
graph i c 


access  to  more  powerful 
s  and  software,  and  to  larger 
ts  than  may  be  accommodated  on  s 
e  microcomputer,  we  may  look  to 
stance  commun Icat i on  via 
Is.  Here  the  major  limitations 
the  flexibility  and  convenience 
and  command  input  and  of 
s  or  printed  output.  Few  voice 
communication  facilities  have 
ent  capacity  for  detailed 
s,  either  input  or  output. 


Despite  their  present  limitations, 
various  communications  technologies 
available  now  do  allow  the  sharing  of 
software  and  data  sets,  but  only  if  the 
program  or  data  files  are  in  some  sense 
“standard*  so  that  the  recipient  may 
make  use  of  them.  To  date,  standards 
for  these  statistical,  as  opposed  to 
computational,  constructs  are  not  in 
p 1  ace . 


Finally,  even  when  the  ideas  behind  a 
particular  statistical  computation  have 
been  transmitted  between  practitioners, 
we  may  observe  that  the  results 
obtained  by  the  different  workers  are 
not  the  same.  Ultimately,  we  need  a 
commonality  of  approach  and  methods  at 
a  relatively  detailed  level.  Simply 
specifyiftg  a  method,  for  example  linear 
regression,  is  far  from  sufficient. 


We  now  examin 
more  detai 1 . 


e  some  of  these  ideas  in 


DATA 


Data  has  many  attributes:  format, 
medium,  content  (or  lack  thereof), 
timeliness,  volume  (of  data),  history 
(author,  origin,  methods  of  gathering, 
notes  and  opinions),  imputation 
methods,  sampling  design,  aggregation 
procedures,  whether  "raw*  or  "cooked", 
security  or  confidentiality  status  (1 
owe  this  addition  to  a  conversation 
with  Gordon  Sande).  Other  workers, 


The  aspects  of  tabulation  and  display 
which  render  them  useful  as  tools  for 
statistical  analysis  are  the  very 
features  which  are  obstacles  to 
portability.  These  can  be  summarized 
as  form,  style  and  practice.  Form  will 
reflect  the  overall  type  of  design 
followed.  Cleveland  (111,  121,  133)  has 

made  a  number  of  observations  on  form 
which  also  reflect  on  style  how  the 
particular  form  is  translated  to  the 
object  seen.  The  impact  of  available 
machinery  on  form  and  style  chosen  is 
obvious  if  one  considers  but  one 
example,  the  Chernoff  face.  This 
display  translates  elements  of  a 
multivariate  observation  into  features 
loosely  resembling  a  human  face.  I 
have  personally  found  it  a  useful 
mechanism  for  demonstrating  results  but 
a  rather  poor  exploratory  data  analysis 
tool.  Nevertheless,  if  one  wishes  to 
use  "faces",  then  some  way  of  drawing 
them  must  be  found. 


Traditional  approaches  (Flury  b 
Riedwyl,  1981)  use  plotters  of  various 
types.  One  can  envisage  bit -map 
displays  of  modern  microcomputers  (e.g. 
Macintosh)  being  suitable,  but 
conventional  computer  terminals  lack 
the  flexibility  to  "draw"  the  necessary 
graphs.  An  alternative  approach  is  to 
change  the  style,  and  to  some  extent 
the  form,  of  the  "face*  and  use 
pr 1 nt er -p I Dt  ideas.  Turner  k  Tidmore 
(1981)  developed  a  FORTRAN  program  for 
this  which  was  relatively  easily 
transferred  to  the  Amdahl  mainframe  at 
the  University  of  Ottawa  by  Mr.  P. 
Beynon,  one  of  my  students.  Later  Fred 
Brown  designed  a  face-drawing  program 
in  BASIC  for  an  Osborne  1,  in  the 
process  applying  some  ideas  from 
portraiture  to  improve  the  "facial" 
proport  1  on . 


In  transporting  their  analyses, 
statisticians  are  unlikely  to  be 
satisfied  with  just  one  of  the  above 
alternatives  being  available.  When 


a  0 


9r«phicAl  d»vic«B  «vailabl»«  th» 

prlnt*r-pXot  !■  unlikely  to  matlsfy. 
Th»r»4or«,  *  r»n9«  of  moftnar*  im  90in9 
to  be  naedadt  all  piacaa  of  which 
should  intarfaca  aasily  to  tha  data  and 
to  tha  command  procasaor.  tharaby 
allowin9  tha  statistician  to  control 
tha  computations. 

As  a  footnota  to  this  discussion,  I 
would  lika  to  point  out  that 
statistical  displays  of  a  ralativaly 
advancad  natura  ara  bain9  usad  outsida 
tha  orofassion.  On  Monday,  March  11, 
1987.  nn  paga  B6  of  tha  Toronto  Globa 
rnd  Mail  (Raport  on  Businass)  is  a 
*l’Ute  nicaly  axacutad  sat  of  star 
displays  with  an  lntarastin9  choica  of 
axes  directions  and  scalln9s.  This 
serves  to  underline  tha  need  for 
standardization  of  tha  practice  of 
tabulation  and  display  so  that  readers 
movin9  from  one  sat  of  displays  to 
another  ara  not  fooled  by  a  simple 
chansa  in  tha  conventions. 


Methods  are  tha  translation  of 
statistical  thou9ht  into  procedures. 

The  greatest  obstacle  hare  to 
portability  is  tha  many  levels  of 
choice  in  transferring  tha  gsneral  idea 
into  a  specific  and  unambiguous 
procedure.  Ror  instance,  in 
considering  tha  general  method  of 
regression,  100  years  old  this  year,  wa 
must  first  decide  between  the  usual 
least  squares  loss  function  or  other 
metrics,  second  iassuming  least 
squares)  whether  conventional  linear, 
ridge  or  nonlinear  approaches  should  be 
used,  and  third  (assuming  conventional 
linear  l.s.)  which  algorithm  to 
implement.  Even  having  chosen  a 
particular  algorithm  in  general,  for 
example,  solution  of  normal  equations, 
QR  decomposition  or  singular  value 
decomposition  of  the  independent 
variable  matrix  (Nash,  1984,  p. 
i66ff),  we  may  have  to  select  an 
imp  1 ementat 1 on  approach . 

So  far,  we  have  no  executable  program 
code.  Software  is  the  realization  of 
methods,  and  once  again  it  is  the 
diversity  of  options  which  hampers  the 
portability  of  tha  statistical 
computations.  We  may  choose  to 
organize  our  statistical  software  as 
individual  programs  which  stand  alone, 
as  a  collection  or  library  of  related 
programs  and/or  subroutines,  or  as  an 
integrated  package  not  requiring  the 
user  to  provide  controls  or  operating 
system  commands.  Clearly  the  current 


trend  is  toward  packages,  even  though 
this  may  make  it  more  difficult  to 
perform  particular  computations  in 
particular  computing  environments.  Tha 
usual  form  in  which  packages  are 
distributed  is  as  an  ansambla  of  code 
executable  on  a  particular  computer 
conf 1 gur at  1  on,  since  it  runs  against 
tha  producers*  interests  to  have  users 
transport  (steal?)  the  coda  to  other 
machines.  Libraries  are  usually 
available  only  in  machine  (object)  coda 
form,  while  the  individual  programs  of 
statistical  software  may  be  found  as 
source  code. 

Source  code  must  be  expressed  in  some 
programming  language,  and  most  object 
coda  reflects  some  of  tha  constraints 
Implicit  in  all  programming  languages. 
The  languages  themselves  echo  features 
of  the  hardware  which  is  available  -- 
floating-point  arithmetic,  graphical 
devices,  memory  management.  At  tha 
hardware  level,  we  note  that  there  are 
many  established  international, 
national  or  institutional  standards 
which  have  been  agreed  and  adopted.  (I 
specifically  exclude  the  so'-called 
"industry  standards'*  created  by 
advertising  copy  writers.)  Programming 
language  standards  are  gradually  having 
an  influence  on  the  software  being 
written,  but  to  my  knowledge  there  are 
no  standards  yet  being  considered  for 
the  design  and  expression  of  program 
packages.  For  the  user  to  be  able  to 
begin  using  one  package  after 
experience  with  another,  some 
reasonably  simple  guidelines  ara 
clearly  needed  for  the  user  interface, 
for  the  meaning  of  commonly  used  words, 
and  for  accessing  data,  devices,  or 
other  computing  resources. 

As  statisticians  we  should  be  more 
aggressive  in  supporting  existing 
standards,  even  as  we  begin  the  search 
for  new  ones  to  cover  our  particular 
area  of  work.  Our  lack  of  awareness  of 
programming  standards  is  illustrated  by 
coda  published  by  Frank  (1981)  in  tha 
Journal  of  tha  American  Statistical 
Association.  In  a  program  barely  one 
page  in  length,  practically  each  line 
has  some  construct  or  other  which  is 
non-standard,  a  typographical  error,  or 
a  stylistic  fault.  If  tha  purpose  in 
publishing  this  code  is  to  allow  its 
use  by  other  statisticians,  then  the 
editors,  even  more  than  the  author, 
have  missed  the  target! 


HANDLING  CHOICE 


To  r«nd»r  our  computat loom  portabla  to 
other  computing  eriv i ronments  and 
pract  i  t  ionerSa  I  mu9geat  -four  main 
routes: 

1)  Documentation  o-f  suf-ficient  quality 
is  needed  so  that  all  relevant  details 
o-f  the  implementation  o-f  a  method  or 
the  characteristics  of  a  data  set  or 
approach  to  an  analysis  are  clmAr ly 
discernible.  Special  features  --  the 
exceptions  to  the  rules  --  need  to  be 
noted . 

2)  Statisticians  need  to  agree,  either 
formally  or  informally,  on  the 
procedures  and  ideas  of  standard 
algorithms  and  practices.  While  the 
effort  to  formalize  agreement  may 
appear  to  be  enormous,  there  is  a 
growing  body  of  work  which  is  carried 
out  by  specific  methods  attributed  to 
workers  by  name,  for  example, 
Marquardt's  method  for  nonlinear  least 
squares  parameter  estimation.  Such 
methods  can  be  written  down  clearly 
(Nash.  1979)  in  step-and-descr i pt i on 
form,  and  mod i f icat ions  can  be  noted  in 
suitable  documentation.  However,  the 
will  is  needed  to  perform  activities 
seemingly  peripheral  to  statistics. 

3)  For  most  statistical  analysis  the 
computations  may  be  considered 
conventional.  To  avoid  disagreements 
over  the  results,  standard  computer 
programs  and  data  handling  procedures 
are  needed.  Again,  the  effort  to 
obtain  formal  agreement  may  not  be 
required,  since  many  statisticians  are 
using  a  relatively  small  set  of 
packages  such  as  fiinitab,  SAS,  SPSS  or 
BMDP.  There  is  a  considerable  interest 
in  the  development  of  test  problems 
(see  the  workshop  session  "Measuring 
the  performance  of  statisticians  with 
statistical  software*  of  these 
proceedings)  and  it  is  likely  the 
producers  of  packages  will  align  their 
major  programs  to  produce  similar 
results  in  order  to  avoid  criticism  and 
consequent  marketing  headaches.  Once 
again,  variations  on  a  theme  need  to  be 
documented.  Moreover,  the  existence  of 
a  standard  method  should  not  prevent 
researchers  from  attempting  different 
approaches. 

A)  Mechanisms  need  to  be  established 
for  resolving  real  or  apparent 
inconsistencies  in  results. 
Statisticians  are  in  the  forefront  in 
this  regard,  since  our  journals  have 
adopted  a  practice  of  presenting  papers 
followed  by  discussions.  This  presents 


one  avenue  for  airing  differences  of 
opinion.  For  discussions  at  a  more 
detailed  level,  workers  may  want  to 
consider  establishing  electronic  mail 
conferences,  moderated  by  knowledgeable 
researchers  who  can  focus  discussion. 

DOCUMENTATION 

My  firm  opinion  is  that  good 
documentation  is  the  core  of  advances 
in  portability,  and  should  mention  the 
f ol lowi ng: 

-  the  data  or  type  of  data  which  can 
be/was  analyzed 

-  the  methods,  algorithms,  software 

used 

-  the  time/date  when  each  entry  in  the 
documentation  was  made 

-  all  edits  (of  data  /  methods  / 
documen tat  ion) 

-  observations  !  comments  /  hunches 

-  the  name(s)  of  persons  adding  to  or 
changing  documentation. 

TRAINING,  EDUCATION  AND  RESEARCH 

Portability  of  statistical  computing 
concerns  the  transfer  of  ideas,  which 
at  present  is  plagued  by  our  academic 
traditions.  These  have  led  to  delays 
in  publication  because  of  the  financial 
pressures  on  Journals  and  the  slowness 
of  refereeing  and  review.  Worse,  since 
academic  workers*  career  development 
depends  in  part  on  journal  articles, 
there  is  little  credit  for 
non* trad i t i ona 1  forms  of  idea  transfer 
*-  computer  conferencing,  software 
development,  computer  aided  instruction 
development.  It  is  also  clear  that 
use  is  going  to  be  made  of  statistical 
computation  by  those  who  have  had  no 
part  in  developing  the  tools  **  new 
statisticians,  professionals  in  other 
disciplines,  and  the  general  public. 

The  last  group  is  an  increasing  "user* 
in  developing  business  or  public 
policy,  where  it  is  important  to  argue 
the  consequences  of  decisions  rather 
than  the  validity  of  the  data  or 
methods.  Consequently,  impatience  with 
results  which  cannot  be  repeated  is  to 
be  expected,  and  the  codification  and 
standard i zat i on  of  statistical  practice 
can  have  a  large  payoff. 

A  by-product  of  such  codification  is 
that  it  permits  expert  systems,  either 
tactical  (for  specific  types  of 
compu tat  1 ons )  or  strategic  (to 
recommend  global  approaches  to  data 
analysis) •  to  be  developed. 


C  t  f 


REALIZATION  OF  PORTABILITY 

Th»  dlmcu««ian  abov*  ham  a  paaalbla 
concrata  raalizatlon  Mhlch  can  ba  ba9un 
immad 1 ata k y .  Tha  tachnical 
raqu i ramanta  to  allow  atatiatical  data 
and  aoftwara  to  ba  tranafarrad  -from 
location  to  location  via  comman i cat t ona 
tachnologlaa  can  ba  mat,  avan  if  not 
with  9raat  aasa.  At  a  minimum,  thaaa 
requiremants  ara 

1)  file  formata  for  pro9raTna  and  data, 
which  I  would  currantly  racommand  ba 
slmpla  text  fllaa  <coda  may  have  to  ba 
tranafarrad  aa  haxadaclmal  di9ita)« 

2)  file  tranafar  machaniama,  auch  aa 
electronic  mall  with  aultabla  file 
server (a) .  Byte  ma9azlna  already 
allows  users  to  download  pro9^ama  which 
have  appeared  in  tha  ma9a2lna,  but 
access  is  at  tha  moment  via 

1 ong-d i stance  voice  lines,  which  are 
much  more  axpanaiva  than  tha 
packet-switched  data  networks. 

3)  standards  for  data  and  programs. 


Tha  third  " i nvaatmant •  needed  is  in  tha 
development  of  tha  intellectual 
property  to  ba  tranafarrad  and  shared 
among  statisticians.  Developers  will 
have  to  receive  academic  credit  for 
such  work,  or  it  will  have  to  ba 
remunerated  in  the  marketplace.  The 
latter  remuneration  requires  royalties 
to  be  paid,  suitable  cooperative 
enforcement  of  ownership  of  the 
intellectual  property,  and  attractive 
pricing  and  service  by  the  vendors  to 
encourage  users  to  obtain  the  material 
from  the  authorized  source.  Indeed, 
software  vendors  such  as  Borland 
International  have  demonstrated  that  a 
good  product  at  an  attractive  price 
will  not  be  "stolen"  to  an  appreciable 


PROGNOSIS 

The  above  recipe  for  permitting 
portability  of  statistical  computing 
via  a  central  database  of  data, 
programs  and  documentat i on  is  feasible 
to  try  now.  I  believe  that  the  time  is 
ripe  to  begin  some  experiments  in 


While  not  yet  established,  one  can 
imagine  a  relatively  simple,  limited 
standard  for  small  to  medium  sized  data 
sets  and  for  the  expression  of  programs 
in  source  code  in  one  or  more 
programming  language  for  a  restricted 
class  of  target  machines. 

The  technical  requirements,  as 
delineated  above,  will  not  be 
translated  into  a  reality  without 
investments.  First,  entrepreneur s  will 
need  to  foresee  sufficient  rewards  to 
justify  the  expenditure  for  a 
"head-end"  file  store  to  maintain  the 
base  of  data  and  software  with 
attendant  te 1 ecommun i cat  ions  hardware 
and  software  to  allow  easy  access  for 
(possibly)  naive  users.  The  hardware 
for  te 1 ecommun i cat i ons  at  the  present 
time  should  probably  link  to  one  or 
more  of  the  public  packet  switched 
networks  rather  that  the  usual 
voice-line  telephone.  Software  must 
handle  both  the  database  as  well  as  the 
user  interface.  Simple  but  effective 
charging  algorithms  are  needed  so  that 
revenues  can  be  recorded  and  collected 
without  undue  difficulty  for 
subscr i ber s • 

The  development  of  standards  requires 
investments  of  time  and  money  on  the 
part  of  those  involved  in  statistical 
computing.  Except  in  the  quality 
control  area,  statisticians  have  not 
yet  participated  (as  statisticians)  in 
these  types  of  activities. 


restricted  areas  of  statistical 
computation  to  discover  the  details  of 
design  which  will  facilitate  further 
progress.  Standards  for  computer 
programs  for  statistical  computations 
are  overdue,  particularly  for  those 
which  are  published  in  journals.  In 
order  to  move  from  the  domain  of 
research  to  generally  available 
reality,  analyses  of  the  risks  and 
benefits  of  commercial  investment  will 
need  to  be  prepared,  and  consortia 
formed  to  market  (partial) 
implementations  of  such  systems.  This 
last  point  represents  the  end-goal  of 
the  ideas  presented  here,  and  believing 
that  the  concepts  presented  are 
feasible  to  carry  out,  I  have  started 
to  seek  business  alliances  to  realize 
them.  However,  I  hope  that  those  in 
the  audience  who  do  not  accept  the 
total  parcel  presented  will  still  find 
valuable  points  within  the  discussion. 
Finally,  while  I  have  focussed  on 
moving  ideas  rather  than  people  and 
machinery,  it  should  be  kept  in  mind 
that  there  are  often  reasons  why  it  is 
necessary  to  travel  and  transport  in 
order  to  take  our  statistical 
computations  with  us. 
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ON  BOOTSTRAP  ESTIMATES  OF  FORECAST  MEAN  SQUARE  ERRORS  FOR  AUTOREGRESSIVE  PROCESSES 
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This  paper  presents  several  analyses  which  suggest  that  the  bootstrap  procedure  used 
by  Freedman  and  Peters  to  simulate  errors  in  forecasting  future  values  of  an  econo- 
metrical  ly  modelled  process  is  of  limited  usefulness  for  estimating  mean  square  fore 
cast  errors. 


1.  INTRODUCTION 

Freedman  and  Peters  (1984)  recently  applied 
a  resampling  procedure  (the  "bootstrap")  to 
obtain  estimates  of  mean  square  error  for  the 
forecasts  from  an  autoregression  with  exoge- 
neous  terms.  In  this  paper,  we  start  with  a 
theoretical  analysis  of  their  suggested 
procedure  for  the  case  of  (not  necessarily 
stationary)  autoregressive  models  without 
exogenous  terms  and  later  describe  two  situ¬ 
ations  in  which  the  same  conclusions  hold  in 
the  presence  of  exogenous  variables. 


2.  BOOTSTRAP  ESTIMATES  OF  UNCONDITIONAL 
MEAN  SQUARE  FORECAST  ERROR 


The  simple  bootstrap  procedure  of  Freedman  and 
Peters  we  described  below  would  appear  to  be 
appropriate  when  observations  yj . .  are 

available  from  a  time  series  obeying  a  general 
p-th  order  autoregression  (p<T)  of  the  form 

(2.1)  Yt  ’  *  ♦  ♦lYt-l  *  •••  *  ♦pyt-p 


The  theoretical  mean  square  forecast  error 
from  an  estimated  model  is  the  sum  of  two 
components,  the  mean  square  forecast  error 
of  the  optimal  predictor  and  the  mean  square 
difference  between  the  optimal  forecast  and 
the  estimated  model's  forecast.  This  latter 
component  is  of  order  1/T,  where  T  is  the 
length  of  the  observed  series,  and  so  is 
negligible  with  large  samples.  Our  theoret¬ 
ical  analysis  in  Section  2  shows  that  the 
bootstrap  estimate  of  mean  square  forecast 
error  is  the  sum  of  the  usual  (naive)  large- 
sample  estimate  of 'the  first  component,  easi¬ 
ly  obtainable  without  the  bootstrap,  and  a 
small -sample  estimate  of  the  second.  A 
gaussian  Monte  Carlo  value  of  the  second  com¬ 
ponent  is  obtained  in  Section  3  for  series 
of  length  25  from  the  AR(2)  models  used  in 
the  study  of  Ansley  and  Newbold,  along  with 
the  value  of  the  root  mean  square  error 
(rmse)  of  the  large-sample  estimator  of  the 
m-step-ahead  forecast  error,  for  m  •  1,  2 
and  5.  In  these  examples,  the  rmse  is  always 
substan  'ally  larger  than  the  0(1/T)  compo¬ 
nent,  supporting  the  observation  of  Stine 
(1982)  that  estimates  of  the  second  com¬ 
ponent  are  of  little  use  in  estimating  mean 
square  forecast  error  unless  better  estima¬ 
tors  of  the  first  component  are  available. 

In  the  final  section,  we  discuss  conditional 
forecast  mean  square  errors  associated  with 
predictions  of  the  future  of  the  observed 
sample  path,  and  conclude  that  in  this  context 
as  well,  the  bootstrap's  potential  contribu¬ 
tion  seems  limited. 


+  et  (t>p+l)  , 

where  et  (t>p+l)  are  Independent,  identically 
distributed  random  variables  with  mean  0  and 

variance  which  are  Independent  of  earlier 
y's;  that  is,  for  lc>0,  et  and  yt.t  are  inde- 

dependent.  It  is  assumed  that  the  order  p  is 
known  and,  only  for  simplicity  of  notation, 
that  all  of  the  parameters  4i,...,4p  and  t 

are  unknown.  Define  £  =  (4,4l . ♦p)*  for  any 

m>0  we  can  use  back  substitution  in  (2.1)  to  obtain 

(2-2)  yy+m  »  ♦jeT+m-j 

+  fnir®1(yT»-”*YT-p+l)  • 

where  the  coefficients  ♦q(-1),  tj,  t?,...  satisfy 

(2.3)  H^j-k  "0  (*0  •  -1). 

and  where  fm[®](yT»”'»yT-p+l)  1*  linear  in 

yy . yy-p+1  *  •  fof  example,  if  p*l,  then 

♦j  -  4^  and  f„[(« ,*i ) ](yt )  *  *(l  +  *1  +  ••• 

+  ♦y*^)  +  ♦’fyf  The  two  expressions  on  the 


I 


If  T  Is  small,  however,  then  the  second  term 
on  the  right  In  (2.5)  need  not  be  negligible. 
Also,  the  quantity  (2.6)  may  be  an  Inade¬ 


I 

I 

I  right  hand  side  of  (2.2)  are  stochastically 

Independent  since  e's  are  Independent  of 
earlier  y's.  It  follows  from  this  that 
1'mt®](yT . yi+p-l)  describes  the  optimal 

forecast  (the  conditional  mean  of  yj+,fl  given 

yi.”-»yT)  that  5]™  ♦ieT+-,.<  Is  the  result¬ 

ing  forecast  error.  J-0  •* 

I  This  optimal  forecast  cannot  be  precisely  deter¬ 

mined  because  ^  is  unknown,  if  T  = 

(«.  Ti...'»'^p)  eey  estimate  oT  6^  obtained 

yi . .  tfen  f„[e](yy . yj-p+i) 

Is  a  forecast  of  yj+n,  with  forecast  error 

1  (2-^)  yi+m  *  fmti](yT . yr-p+i) 

'*'j®T*m-J  .  t^ml^il(yT»'*’»yT-p+l) 

-  v£j(yT . yj-p+i))* 

since  the  ej^.j,  j«0,,..,m-l  are  Independent  of 

y\ 

the  two  terms  on  the  right  hand  side  of 

(2.4)  are  independent.  Consequently,  using  E 
to  denote  expectation,  the  mean  square  m-step- 
ahead  forecast  error  when  the  forecast  Is 
given  by  Tm[£](yT.-->.yT-p+l)  S«t1sf1es 

(■*•5)  Elyy^  -  V£](yT.....yT.p+i))^ 

•  ♦!  "  E(  ve](yy . yT.p,j) 

'  V?.]<yT . yT-p+i)>^  • 

If  T  Is  large,  and  9_  Is  a  consistent  estimator  of 

£  (e.g.  from  least  squares.  If  E|e(.|'*  <»  for 

some  a>2,  see  Lai  and  Wei  (1983)),  then  the 
second  terra  on  the  right  In  (2.5)  can  be  Ig¬ 
nored  and  the  mean  square  forecast  error  can 
be  adequately  approximated  by 


(2.6)  o^(T-p)  ♦? 

j”0  •' 

where  the  ♦'s  are  obtained  by  using  *4's  In 
(2.3),  and  o2(T-p)  Is  given  by 

(2.7)  o^(T-p)  •  (T-p)-l 

t»p+l 

{yt  - «  -  ♦lyt-i  -•••- Vt-p’^- 


quate  approximation  to  y"  ^  ♦?.  For  the 

J*0  ■> 


situation  In  which  T  Is  small,  Freedman  and 
Peters  (1983)  propose  the  following  bootstrap 
procedure.  Define 

et  -  yt  -■?  -  nyt-i  -•••- Vt-P  • 

t  •  p+1 . T 


Since  we  are  concerned  with  the  situation  In 
which  only  one  realization  of  the  series  y^ 

Is  observed,  we  will  now  regard  the^y's  and 
^  as  fixed.  Ue  will  assume  that  the  sample 
mean  e  of  the  ?’s  Is  0,  as  happens,  for  ex¬ 
ample,  when Is  chosen  to  minimize '3^(T-p) 

In  (2.7).  (ITtherwIse,  use 'ey  -  e  In  place 

of  below.)  Then  If  we  define  ey,  t>p  , 

by  successive  Independent  draws  with  replace¬ 
ment  from  {?p4.i,...,?j),  we  obtain  a 

series  of  Identically  distributed  random  var¬ 
iables  with  mean  0  and  variance ^2(x.p) 
whose  common  distribution  Is  the  empirical 
distribution  of  {ep4,j,...,?T) .  Now  we 

define  the  so-called  psuedo-data  series,  yy, 
by  means  of  yy  ■  yy,  l<t<p  and 
(2.8)  yy  =  «  +  Piyy.!  +  ...  +  Ppyy-p 
+  ey  (t>p). 

The  e*'s  are  Independent  of  earlier  y*'s.  Let 


*  A 

£  denote  the  value  corresponding  to  £  when 
yi.---,yT  S'"®  'Js®d  In  place  of  the  orlg- 

A 

Inal  values  yi,...,y7:  For  example.  If  £  was  ob¬ 


tained  by  least  squares,  we  choose  £*  so  that 


1  fyt  - «  -  ♦lyt-i . Vt-p’  • 

t-p+i 

Is  minimized. 


I 

I 

i 


We  have  now  created  an  analogue  of  the  orig¬ 
inal  situation,  but  one  In  which  we  can  use  a 
(psuedo-)  random  number  genenjtor  to  s_^mulate 
draws  with  replacement  from  (ep+i,...,e7)  and 


I 


so  obtain  as  many  (psuedo-)  Independent  real¬ 
izations  of  yif..yT+in  like.  With 

these  realizations,  finally,  we  can  approxi¬ 
mate  the  distribution  of  the  forecast  error 

process  y^^^  -  fm[9*](yT.”*.yT-p+l) 

to  any  desired  degree  of  accuracy.  To  the  ex¬ 
tent  that  this  resembles  the  distribution  of 

yr+m  -  fmllKnf'.yT+p-l)*  “e  thereby  gain 

Information  about  the  error  process  In  which 
we  are  actually  Interested. 

For  example,  following  Freedman  and  Peters 


deviation  of  fmr®1(yT«-*->yT-p+l)  trom 

Oral’ll  (yr . yj-p+iK 

(?.12)  E{f„lll(yT . yj-p+l)  - 

fni[®.l(yT»"  ••yj-p+i  ^  • 

appearing  as  the  second  component  on  the  right 
hand  side  of  (2.5).  Since  the  quantity  (2.6) 

Is  known  Independently  of  the  bootstrap  pro¬ 
cedure,  we  conclude  that  an  estimate  of  (2.11) 
Is,  In  fact,  the  only  contribution  made  by  this 
procedure.  Further,  to  estimate  (2.11)  It  Is 

clear  that  psuedo-future  data  yf -f  1  •  *  ■  ■  •yT-Hx 

are  not  required,  but  only  realizations  of 


(19831,  given  realizations  y*('’),...,y*('’^ 

1  T>m 

n>l,...,N,  we  can  approximate 

(2-9)  E*‘n+m  - 

. y^p.i)>^ 

by  means  of 


yj,....yy.  Thus,  In  place  of  Freedman  and 

Peters’  procedure  to  estimate  the  mean  square 
m-step-ahead  forecast  error.  It  seems  appro¬ 
priate  to  only  consider  quantities 

(2.13)  N-lj”  (ySKy^f"’ . y^f"!  )  - 

n=l  "  “  T  T-p+1 


N-1  I"  ,  ty*!;')  - 
n»l  Tbii 


.y;("l,))2 

T-p+1 


fmr9*^'’h(y*('’’ 


v*(n) 

'^T 


t’^:i 


using  these  to  estimate  (2.12),  the  component 
of  mean  square  forecast  error  due  to  the  use 
of  “B  Instead  of  9  In  the  forecast  function. 


(In  (2.9)  and  below,  we  use  E*  to  denote  ex¬ 
pectation  with  respect  to  the  distribution  of 

the  series  ej.) 

The  question  Is,  what  Is  the  relationship 
between  the  quantity  (2.9)  and  Flyx+m  - 

fmCi](yT . yT-p+l^J^  2  To  obtain  a  par¬ 

tial  answer,  we  note  that,  by  analogy  with 
(2.5),  the  quantity  (2.9)  Is  equal  to 

(2.10)  o^(T-p)I™  J  ♦?  + 

j*l  J 

E*{Vi](yT . yT-p+i)  - 

V8*]<yT . y^-p+l)^^  • 

Thus,  this  bootstrap  procedure  Inflates  the 
naive  estimate  of  mean  square  prediction  er¬ 
ror,  (2.6),  by  an  amount 

(2.11)  E'lViKy^ . y^.p^l)  - 

V9*i(y? . yT-p+i))^ 

which  Is  clearly  a  proxy  for  the  mean  square 


! 


Somewhat  analogous  observations  can  be  made 
for  the  model  selection  procedure  proposed  In 
Freedman  and  Peters  (1983):  Suppose  two  dif¬ 
ferent  autoregressive  models,  of  orders  p(A) 
and  p(B),  are  fit  to  the  observed  data 
yi»**-*yT»  resulting  In  estimated  parameters  9j^ 

and  93,  residual  populations  {®p(A)+l>***> 

e^)  and  (  9p(B)+l*”'  ,  ,e|),  and  psuedo- 

data  series  yj*  and  yB*  as  above.  Freedman 

and  Peters  suggest  that  each  model  he  fit  to, 
and  then  used  to  forecast,  the  psuedo-data 
from  the  other  model,  and  that  bootstrap  es¬ 
timates  of  the  mean  square  forecast  error  be 
calculated.  The  model  having  the  smaller 
estimated  mean  square  forecast  error  Is  to 
be  preferred.  Thus,  using  an  obvious  nota- 
tlonal  scheme,  the  Idealized  quantities  to 
be  compared  are 


EA*,yA* 


T+ffl 


-  f 


mr?3 


''*l(y;*. 


vA*  ))< 
’^-P(B)’ 


EB*(yB* 

T+jn 


](y“ 


y®*  ))  2 

’^-P(A)'‘ 


I 


) 


and 


By  the  argument  used  to  derl .e  (2.5),  these 
Idealized  quantities  are  equal,  respectively 
to 


(2.U) 


. 

'Stij'ji';* . 


and 

(2.15) 


oS(T-p(R))  l"'!,  * 

J-0  J 

. y?* 


fAr0B*1/yB*  »W 

'(yT  •••••Vp(A)” 


„8* 


to  the  root  mean  square  estimation  error  of  the 
large-sample  estimate  S*(T-p)  of  o^, 
rmse(32(T-p))  «  {E(S2(T-p)  -  . 

In  Table  (3.1)  below,  we  present  Monte  Carlo 
estimates  of  the  ratios  and 

(3.3)  EA2^y/rmse(S2{T-p)) 

for  the  observation  length  T»25  for  some  gaus- 
slan  AR(2)  processes 

(3.4)  y^  .  6  +  ♦lyt-l  +  ♦zYt-l  +  «t 

utilized  In  the  study  of  Ansley  and  Newbold 
(1981).  We  note  that  these  quantities  are 
relevant  for  the  estimation  of  as  well, 

since,  for  example, 

“m.T  '“mfl  "  (e4?j/o2))1/2  , 

which  Is  well  approximated  by 


Since  the  leading  expressions  In  (2.14)  and 
(2.15)  can  he  calculated  Independently  of 
;he  bootstrap,  we  see,  as  before,  that  the 
oootstrap's  only  contribution  Is  to  compare 
forecasts  and  that  psuedo-data  at  times  later 
than  T  are  not  needed  for  this. 

All  of  the  arguments  given  above  also  apply  to 
the  case  of  vector  autoregressions,  and  thus 
also  to  the  case  of  autoregressions  with  exo- 
geneous  variables,  provided  that  endogeneous 
and  exogenous  variables  are  simultaneously 
forecasted  from  a  combined  vector  autoregres¬ 
sion.  They  also  apply  If  all  needed  values 
of  the  exogenous  variables  are  assumed  to  be 
nonrandom  and  known,  as  In  Freedman  and 
Peters  (1984) 


If  (E&2j/o2)2/8  Is  negligible  (Taylor 

polynomial  approximation).  For  each  pair  of 

coefficients  ♦!,  *2  In  the  Table,  we 

estimated  the  quantities  E4^,t 

rmse(8^(T-p) )  as  the  mean  of  sample  estimates 
obtained  from  1000  stationary  pseudo-Gausslan 
series  satisfying  (3.4)  with  S  «  0,  using  least 
squares  to  estimate  S,  Aj  and  4^.  (The  IHSL 

pseudo-Gausslan  generator  GGNML  was  utilized.) 
The  tabled  results  suggest  that  estimation  of 

A  9 

Ea^^T  Is  of  little  consequence  when 
0|J(T-p)  Is  used  to  estimate  o^. 


3.  THE  SIZE  OF  (2.12)  IN  SOME  EXAMPLES 

Again  using  an  obvious  notation,  let  us  re¬ 
write  (2.5)  as 

(3.1)  <T  *  4  *  eS^.T 

The  analogous  formula  for  the  bootstrap  esti¬ 
mate  (see  (2.10))  can  be  written 

(3.?)  <T  • 

For  estimating  j,  the  practical  signifi¬ 
cance  of  having  an  estimate  E*af^„,  of  de¬ 

pends  upon  the  size  of  eX^^j  relative  to  and 


■  wm  '^^unTi  ^'ir>n  rtn  ^inr  ^  ^  •  v 


3 


i 


li 


Table  3.1 

Values 

of 

j/o^  and  (3.3)  for 

with  the  result  that  this  second  term  simpli¬ 
fies  Into  a  llneaj;  expression  In  the  higher 

M=l,  2  and  5,  for  selected 
slan  AR(2)  processes, with 

Gaus- 

T-25. 

order  moments  of  £  -  £.  The  mean-zero  first 
order  case  Is  Illustrative:  If 

♦l 

♦2 

m 

(3.1) 

yt  '  ♦yt-i  *  ®t  (♦*0)  (^•^) 

with  e^,  t>l,  1.1. d.  having  mean  0  and  vari¬ 

.40 

-.15 

1 

.01 

.02 

ance  0^,  and  with  ej.  Independent  of  y^.j^  when¬ 

2 

.01 

.01 

ever  k>0,  then  fi^E^Kyy)  •  ♦"Vy  Ecom  the 

5 

.00 

.01 

the  Taylor  polynomial  expansion  of  f^C^Kyj) 

.80 

-.65 

1 

.01 

.05 

about  ^  we  have 

2 

.04 

.04 

fmtJKyT)  '  Emt*l(yT)  * 

5 

.02 

.02 

.80 

-.16 

1 

.03 

.04 

2 

.02 

.03 

where  »  m(m-l),..(m-j+l)/j!. 

5 

.02 

.04 

Taking  the  mean  square  of  (4.2)  conditional 

on  yj,  we  obtain 


We  have  not  Included  results  for  those  of 
Ansley  and  Newbold's  AR(2)  models  whose 
characteristic  polynomials  have  a  root  In 
the  annulus  1.0<(z(<1.24.  With  T*25, 
simulations  for  such  models  produced  large 
numbers  of  explosive  series  (the  esti¬ 
mated  characteristic  polynomials  had  a  root 
In  Uhl.O).  . 


Elf^rtKyr)  -  * 


(4.3) 


4,  CONDITIONAL  MEAN  SOUARE  FORECAST  ERROR 


In  the  preceding  sections,  we  Investigated  un¬ 
conditional  mean  square  forecast  error.  How¬ 
ever,  It  Is  the  error  associated  with  predict¬ 
ing  a  future  point  on  the  observed  sample 
path  (realization)  which  usually  is  most  of 
interest. 


To  estimate  (4.3)  via  the  bootstrap,  we  re¬ 
place  y^  In  (2.11)  by  y-j-  (Ideally  gener¬ 
ating  the  pseudo-data  in  such  a  way  that  y^ 
yy,  but  see  4B.  below).  By  analogy  with 
(4.3),  we  then  have 

E*{f„,[4*](yT)  -  = 


4A.  Mean  Square  Error  Formulas 


(4.4) 


Since,  by  (2.1),  the  value  of  yi+m 
depends  on  the  data  yi,...,yT  only  through 


The  efficacy  of  the  bootstrap  procedure  Is 
usually  related  to  the  extent  to  which  the 


the  last  p  observations.  It  Is  easy  to  check 
that  we  can  simply  reinterpret  the  expectation 
operator  E  In  (2.5)  as  designating  expectation 
conditional  upon  yy.yT.i . yT-(p+l) 


distribution  of  9^  -  0  resembles  that  of  9  -  0  and 
to  how  Insensitive  tFis  latter  distribution  Ts 
to  the  true  parameter  value  9.  However,  for 
our  problem,  the  situation  illustrated  by  (4.3) 
and  (4.4)  obviously  holds  generally:  the  ex- 


thereby  obtain  the  fundamental  decomposition 
of  the  mean  square  forecast  error  conditional 
upon  the  observed  sample  path.  The 
yT.YT-l . yT-(p+l)  In  the  second  term  on 


the  right  In  (2.5)  are  now  held  constant. 


pected  mean  square  of  fmClKyT . yT-p+l) 

'  EmCiKyT*' ■••yT-p+l)  conditional  on 
yT.*** .yT-p+l  depends  on  the  true  value 

A 

of  0  as  well  as  on  the  distribution  of  0  -  0, 


A-.V 


S* 


'•Vvi 

fij 


suggesting  that  the  quality  of  the  bootstrap 
approximation  will  be  Influenced  by  the  ac¬ 
curacy  of  £  as  an  estimate  of 


4B.  Bootstrapping.  Conditional  Sample  Paths 


It  would  seem  like  an  attractive  Idea,  when,  as 
In  this  section,  statistics  associated  with 
the  distribution  of  y^  conditional  on  yj,..., 

yi.p+i  are  being  approximated,  to  generate 

pseudodata  y^  for  the  bootstrap  In  such  a 

way  that  yJ  -  y^  holds  for  T-p+l<t<T. 

For  example.  It  would  be  appealing  to  estimate 

♦*  In  (4.1)  from  sample  paths  passing  through 

yr- 

To  Illustrate  a  first  approach  to  accomplishing 
this,  suppose  we  have  bootstrapped  residuals 


w.- 


are  uncorrelated  with  one  another,  satisfy 

EaJ  •  Ee^,  and  each  a^  Is  uncorrelated 

with  y^+j  for  all  j>l.  (This  equation  Is 

sometimes  called  the  time-reversed  representa¬ 
tion  of  the  process  yf)  He  can  therefore  use, 

as  an  estimate  of  4,  the  value  4  minimizing 

rT-l  -  » 

^t-1  «t  ’ ^t  - 

4yt+i,  t»l . T-1,  draw  randomly  with  re¬ 

placement  from  this  set  of  residuals  (after 
centering  about  their  sample  mean)  to  obtain 
aj,...,aT.i  and,  finally,  define  y^  ■  yy 
and 

yt  ■  ♦yt+i  +  »t 


L'‘V- 

r'v'-l 

lV 


..  .N' 

►•“A' 

Kv- 


ep|.i,,.,,ey  from  an  estimate  4  of  4  In  (4.1). 
To  generate  yJ  satisfying 

yt  •  ♦yt-i  +  ®t  •  2<t<T 

with  yJ  =  yy,  we  could  obviously  set  yy  »  yy 

and  recursively  define 

*  i-1  *  C-1  * 

yt  *  ♦  yt+1  -  ♦  et+i  . 

l<t<T-l  .  (4.5) 

In  this  case,  however,  yJ  Is  neither  Inde¬ 
pendent  of  nor  even  uncorrelated  with 

for  l<t<T-l.  Thus  the  bootstrapped 

data  fall  to  have  a  basic  property  of  the 

original  data,  and  the  consequences  of  this 

for  the  estimation  of  4  from  yj,..,,yy 

are  an  unresolved  Issue.  Furthermore,  (4.5) 

1s  numerically  unstable  when  |$|<1. 

When  the  series  y*  Is  stationary,  a  second  ap¬ 
proach,  which  avoids  the  difficulties  just  en¬ 
countered,  would  seem  to  recommend  Itself.  To 
Illustrate  with  the  first  order  case  again. 

If  yy  satisfying  (4.1)  Is  stationary,  then  It 

Is  easy  to  verify  that  the  random  variables  ay 
defined  by 

(4.6) 


for  t  ’  T-l,...,l,  thus  generating  a  pseudo¬ 
data  sample  path  containing  yy.  This  procedure 

Is  appropriate  only  If  the  ay  defined  by 

(4.6)  are  l.i.d.,  since  this  Is  a  property 

of  the  ay. 

He  will  now  show,  however,  that  the  white 
noise  noise  series  ay  can  be  Independent  only 

If  the  curaulants  of  yy  (or,  equivalently, 

those  of  ey)  are  those  of  a  Gaussian  series, 

I.e.,  are  0  for  orders  higher  than  2.  Indeed, 

let  Xp  denote  the  r-th  order  cumulant 

cum(ey . ey)  of  ey  for  some  r>2  (assumed 

to  exist).  Since,  from  (4.6), 

It  Is  easy  to  see  that  the  ay's  are  Independ¬ 
ent  If  and  only  If  ay  Is  Independent  of  yy+j 

for  each  j>l.  In  this  case,  the  r-th  or¬ 
der  cumulants  cum(ay,  yy+j . yt+j) 

0;  see  Brillinger  (1975,  p.  19)  for  the  funda¬ 
mental  properties  of  curaulants.  For  j=l.  In 
particular,  since  we  can  write 

T-t+l  =  ®t+l  ^  ♦'^®t-j 


■•.V.*: 
■>  •, 
y.-\’ 


at  •  yt  -  ♦yt+i 


at  ■  yt  -  ^yt+i  “  -  *61+1  * 

0  -  ♦') 

we  are  then  led  to 
0  =  cum(at.  Yt+l . Yt+l)  * 

-  ♦  cum(et+i,....et+i) 

+  (1  -  ♦J''cum(et.j . ej.j) 

=  -♦)/(!-  ♦'■)1  • 

Since  0<|i>|<l,  It  follows  that  Kp  *  0,  as  as¬ 
serted.  If  the  distribution  of  e^  Is  deter¬ 
mined  by  Its  moments  and  If  all  moments  ex¬ 
ist,  then  e^,  and  hence  also  y^,  is  therefore 

Gaussian.  For  Gaussian  time  series,  however, 
pseudo-Gausslan  Honte  Carlo  simulations  seem 
like  a  more  natural  device  to  use  to  generate 
sample  paths  than  the  bootstrap. 

We  conclude  from  the  preceding  discussion 
that  generally  satisfactory  methods  are 
lacking  for  obtaining  bootstrap  sample  paths 
through  the  final  observations  yT-p+l»“*«yT* 

Remark.  The  calculation  used  above,  showing 
that  assuming  one-step  forward  and  backward 
prediction  are  1.1. d.  Is  tantamount  to  assum¬ 
ing  that  the  observations  are  Gaussian,  can 
be  extended  to  stationary  autoregressive 
processes  of  arbitrary  order.  A  much  more 
general  assertion  Is  made  In  Result  ?.2  of 
Donoho  (1981),  namely,  more  that  a  strictly 
stationary  non-Gausslan  time  series  with 
finite  second  moments  can  have  (Ignoring  re- 
scalings)  at  most  one  Invertible  representa¬ 
tion  as  a  moving  average  of  an  1.1. d.  white 
noise  process.  Some  Important  details  are 
missing  In  the  proof  which  Is  given  there, 
however. 


CONCLUSION 


Our  results  suggest  that  the  estimates  of 
mean  square  forecast  error  which  result  from 
the  bootstrap  procedure  proposed  by  Freedman 
and  Peters  are  not  significantly  more  re- 
rellable  than  the  large' sample  estimates, 
which  are  111-behaved, In  small  samples. 

This  does  not  exclude  the  possibility 
that  other  methods  of  bootstrapping 
these  statistics  could  prove  useful. 
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THE  EM  ALGORITHM  IN  TIME  SERIES  ANALYSIS 


R.  H«  ShuBway 


Division  of  Statistics 
University  of  California 
Davis,  CA  95616 

The  BM  algotlthffl  is  ideally  suited  for  ToaxiTBlzlng  likelihood  functions  arising  In 
time  series  models  involving  stochastic  signals  embedded  in  noise.  Successive  steps 
Involve  simple  regression  computations,  and  the  likelihood  is  nondecreasing  at  each 
step.  Furthermore,  the  algorithm  provides  a  simple  and  natural  approach  to  handling 
problems  caused  by  irregularly  observed  time  series  data.  The  simplicity  of  the 
approach  is  Illustrated  by  applying  the  EH  algorithm  to  the  problem  of  estimating 
parameters  In  the  state-space  model.  &eamplea  involving  biomedical  data,  economic 
data  and  data  collected  from  the  soil  sciences  are  presented  to  illustrate  the 
general  procedure.  A  review  is  given  of  past  experience  in  applying  the  algorithm, 
using  both  alnimally  configured  microcomputers  and  large-scale  mainframes. 


1 .  INTRODUCTION 

One  of  the  benefits  resulting  from  the  explosive 
growth  of  microcomputer  technology  is  that 
research  trarkers  now  have  easy  access  to 
computer  programs  for  applying  some  of  the 
computer  intensive  methods  of  time  series 
analysis.  Two  examples  are  the  Kalman  filtering 
and  smoothing  recursions  for  the  state-space 
model  and  iterative  methods  for  maximum 
likelihood  estimation  using  Newton-Raphaon  or  BN 
algorithms. 

A  very  general  model  which  subsumes  a  whole 
class  of  special  cases  of  interest  In  much  the 
same  way  that  linear  regression  does  is  the 
state-space  model  Introduced  in  Kalman  (I960) 
and  Kalman  and  Bucy  (1961).  Although  the  model 
was  originally  utilized  in  aerospace  related 
research,  it  has  recently  been  applied  to 
modeling  data  from  economics  (Harrison  and 
Stevens  (1976),  Harvey  and  Plerse  (1984), 
Kitagawa  (1981),  Kitagawa  and  Gersch  (1984), 
Shumway  and  Stoffer  (1982)),  medicine  (Jones 
(1964))  and  in  the  soil  sciences  (Shumway 
(1985)). 

The  general  form  of  the  multivariate  state-space 
model  involves  assuming  that  the  rxl  observation 
vector  •  (yit * • • • *yrt) *  written  in  the 

form 

It  “  Atit  +  •  (l-l) 

for  t«l,2,...,n,  where  is  an  rxp  design 
matrix  which  specifies  how  the  unobserved  state 
vector  ■  (x]^(  ,X2( » .  • « (Xp^)  '  can  be  converted 
into  the  observation  vector  yt.  at  any  time  point 
t.  The  additive  rxl  observation  noises  Vfr  are 
assumed  to  be  independent  «rith  Ev^  *  0  and 
covariance 


R  -  (1.2) 

The  form  of  (1.1)  Is  almost  identical  to  the 
standard  regression  model  with  xt  corresponding 
to  a  vector  of  random  regreaalon  coefficients. 

The  behavior  of  the  state  vector  ^  Is 
determined  by  its  initial  value  and  the 
state  equations 

JLt  “  *it-l  +  *t  .  (1-3) 

defined  for  t«l,...,T,  where  *  Is  a  pxp  transi¬ 
tion  matrix  and  is  another  Independent  model 
noise  process  with  Ew^  •  0  and  rxr  model  noise 
covariance  matrix 

I 

Q  -  E(wtwt)  .  (1.4) 

This  is,  of  course,  closely  related  to  the  first 
order  autoregressive  model  defined  previously, 
although  no  restrictions  are  Imposed  to 
guarantee  statlonarlty .  The  specification  Is 
completed  by  assuming  that  the  initial  vector  5^ 
has  mean  and  covariance  matrix 

E  -  E(xo  -  lOCxQ  -  .  (1.5) 

An  Important  feature  of  the  multivariate 
state-space  formulation  Is  that  It  provides  one 
with  a  great  flexibility  in  tailoring  models  to 
special  circumstances.  For  example,  suppose 
that  we  observe 

Yt  •  H  *  '^t 

where  the  unobserved  series  is  the  second- 
order  autoregressive  process 

Xt  -  4>ixt-i  +  4'2’'t-2  +  *lt  • 


This  autoregressive  "signal  plus  noise"  aodel 
can  be  easily  put  Into  the  state-space  format 

(l.l)  and  (1*3)  by  wTi-clng 

Yt  “  (liO)  (  )  +  wt 

Vxt-i/ 

where 

with  the  obvious  Identifications  for  and 

$  In  Equations  (1*1)  and  (1.3).  Hany  different 
specific  models  can  be  expressed  in  state-space 
form  as  we  shall  see  In  later  sections. 

The  introduction  of  the  state-space  approach  as 
a  tool  for  modeling  data  In  the  social  and 
biological  sciences  requires  that  one  be  able  to 
handle  the  model  identif ication  and  parameter 
estimation  problems  since  there  will  rarely  be 
a  well  defined  differential  equation  describing 
the  state  transitions.  Furthermore,  we  would 
like  to  be  able  to  handle  general  versions  of 

(l.l)  and  (1.3)  which  provide  for  the  possi¬ 
bility  of  missing  data  which  occurs  so  often  in 
the  biological  sciences.  The  problems  of 
Interest  for  the  state-space  model  relate  to 
estimating  the  state-vector  ^  and  the  unknown 
parameters  p,  £,  <^,  Q  and  R.  The  problem  of 
estimating  recursively  under  the  assumption 
that  the  parameters  are  kno«m  was  originally 
Qolved  by  Kalman  (1960)  and  Kalman  and  Bucy 
(1961)  and  Is  the  celebrated  Kalman  filter. 

2.  FILTERING,  SMOOTHING  AND  FORECASTING 


Ptu  ■*  ^[(iit  "  *.t)(2Lu  "  *  Izi » *  *  • 'Zp]  *  (2.2) 

Several  cases  of  interest  can  be  distinguished 
depending  on  the  span  of  the  data  and  the  point 
t  at  which  the  estimator  Is  desired.  For 

example,  the  one-step  predictors  ^  are  the 
Kalman  filter  estimators  whereas  the  conditional 
T 

means  x^.,  based  on  the  complete  data  span 
yii-**»yT*  Kalman  smoothed  estimators. 

Forecasting  can  be  defined  as  the  computation  of 

xj  for  t>T. 

The  computation  of  the  quantities  In  equations 

(2.1)  and  (2.2)  Is  a  formidable  undertaking  If 
approached  by  straightforward  methods.  The 
dimensions  of  the  vectors  specified  by  the  model 
are  at  least  rT  x  1  or  pT  x  1  where  T  denotes 
the  number  of  data  points  observed  In  time. 
However,  the  recursions  developed  by  Kalman 
(I960)  and  Kalman  and  Bucy  (1961)  require  only 
that  matrix  computations  of  order  rxr  or  pxp  be 
performed  recursively  to  develop  the  conditional 
means  and  covariances.  The  process  of  finding 
the  Kalman  filter  (x^“^)  and  smoother  (x^) 
estimators  again  Involves  using  the  linearity 
assumption  to  determine  the  mlnlmlzers  of  the 
mean  square  errors  Pft^.  The  derivation 
requires  using  the  projection  theorem 
recursively  In  conjunction  with  the  model 
equations  (l.l)  and  (1.3).  The  reader  Is 
referred  to  Jazwlnskl  (1970)  or  Anderson  and 
Moore  (1979)  for  details. 

The  calculation  of  the  Kalman  filter  estimators 
proceeds  by  the  so-called  forward  recursions 


The  problem  of  estimating  ^  In  the  state-space 

model  (l*l)-(l.5)  can  be  approached  by  noticing 
that  the  linear  estimator  with  minimum  mean 
square  error  Is  the  expectation  conditioned  on 
the  observed  data  In  order  to 

specify  this  procedure,  consider  the  general 
conditional  mean 

It  ”  E(£tlzt . 2.).  (2-1) 

defined  as  a  function  of  t,  the  point  at  which 
we  need  the  value,  and  the  span,  s,  of  data 
which  Is  used  to  determine  the  estimator.  The 
general  mean  squared  covariance  function  of  the 
estimator  (2.1)  will  be  denoted  by 


t-1 


1  1 

1 

(2.3) 

(2.4) 

0 


T  with  25)  ■  H.*  one-step  fore- 


for  t-l, . . 

cast  xj  ^  Is  a  strict  update  of  the  previous 
estimated  value  whereas  the  best  estimator 
involving  current  data  Is  a  weighted  average 

of  ^  ^  and  the  error  that  one  makes  In  pre¬ 
dicting  The  pxr  weight  or  gain  matrix  K^  Is 

defined  as  .  ,  .  , 

Kt  -  +  R)'  .  (2-5) 


I 

1 


I 


vhere  the  covariances  are  updated  recursively 


$ 


using  the  recursions 

Ptt  “  *Pt-l,t-l*  +  Q 


and 


Ptt  -  -  KtAt 


(2.6) 

(2.7) 


t-1 


In 


conditional  on  where  ‘  Is  defined 
(2.1).  The  Innovations,  conditional  on 
Zl> l'>ve  zero  neans  and  covariance 


'’tt 


A.  +  R 


(3.2) 


,\. 

•'u' 


with  P§o 


E. 


If  the  estimator  for  Xj^  Is  to  be  based  on  all  of 

the  data  ^1 . .  »®  need  the  Kalman  smoother 

estimators.  These  can  be  developed  by  solving 
successively  the  backward  recursions  for 
t-T,T-l . 1  using  the  equations 


it-1 


t-1  '  T 

££-1  +  - 


where 


Jt-i  -  pJ:i.t-i*'(p«^)'^ 


(2.8) 


(2.9) 


The  mean  square  error  covariance  for  the 
smoothed  estimator  satisfies  the  recursions 

Pt-l,t-l  -  Pt-l.t-l+Jt-lCPlt-f^M-It-l  (2.10) 

If  a  forecast  Is  needed  It  la  clear  that  one 
only  needs  to  extend  the  forward  recursions 
(2.3)-(2.7)  Into  the  future  under  Che  convention 
that  Kt«0  In  (2.4)  and  (2.7). 

The  Kalman  filter  and  smoother  recursions  give  a 
convenient  means  for  calculating  the  conditional 
expectations  which  are  of  greatest  Interest  In 
solving  problems  In  smoothing  and  forecasting 
for  time  series.  The  data  are  not  required  Co 
be  regularly  spaced  so  that  Che  smoothed 
estimators  xj  can  be  used  In  lieu  of  missing 
values  (see  Section  3).  The  main  problem  which 
remains,  however.  Is  In  specifying  values  for 
the  unknown  parameters  £,  $,  Q  and  R  which 
are  needed  In  order  to  apply  the  recursions. 


The  log  likelihood  for  estimating  Che  parameter 
0  -  (®,Q,R)  is  essentially 

T  T 

logUYje)  8  -  y  £  losUJ-  y  E  (3-3) 

z  T.1  t  z  t-l^  *  ^ 

which  Is  a  highly  nonlinear  function  of  the 
unknown  parameters.  The  usual  procedure  Is  to 
fix  2^  and  then  develop  a  set  of  recursions  for 
the  log  likelihood  function  and  its  first  two 
derivatives.  Then,  a  Newton-Raphson  algorithm 
can  be  used  to  successively,  update  the  parameter 
values  until  the  log  likelihood  (3.3)  Is 
maximized.  This  approach  Is  advocated,  for 
example,  by  Gupta  and  Hehra  (1974),  Ansley  and 
Kohn  (1984),  or  Jones  (1980). 

We  give  a  simpler  approach  here,  based  on  the  EM 
or  expectatlon-maxlmlzallon  algorithm  of 
Dempster  et  al  (1977).  The  EM  algorithm  was 
adapted  to  this  time  series  model  in  Shumway  and 
Scoffer  (1982).  The  EM  algorithm  proceeds  by 
successive  maximizing  Che  current  conditional 
expectation  of  Che  complete  (but  unobserved) 
data  log  likelihood  based  on  X  -  (xo,wj , . . . ,wy, 
V|,...,vj)  conditional  on  the  Incomplete  (but 

observed)  data  T  •  (yj . 2j).  This  complete- 

data  log  likelihood,  given  in  Shumway  and 
Scoffer  (1982),  Involves  the  parameters 
8  ”  ((1,1,$ ,i),R)  In  a  convenient  form  but  cannot 
be  maximized  directly  since  the  Xj^  process  Is 
not  observed.  However,  If  Che  current  value 
of  9  Is  9^  and  denotes  the  expectation  under 
9^  the  EH  algorithm  proceeds  by  maximizing 


Q(9|9i)  -  Eiflog  L(X,9)|ri 


(3.4) 


3.  ESTIMATION  OF  PARAMETERS 

The  estimation  of  the  parameters  Involved  in 
specifying  the  state-space  model  (1«1)-(1.5)  can 
be  accomplished  using  maximum  likelihood  If  we 
are  willing  to  assume  that 

^pes.p^  are  Jointly  normal  and  uncorrelated 
random  vectors* 

The  usual  likelihood  Is  the  "Innovations"  form 
of  Schweppe  (1968) ,  which  Involves  writing  the 
Joint  likelihood  of  the  Innovations 


*  t-1 

ZX  "  ^t£t  • 


(3*1) 


at  each  step*  Equation  (3*A)  can  be  written  In 
terms  of  the  Kalman  smoothed  outputs*  The  maxi¬ 
mization  of  the  resulting  function  with  respect 
to  the  parameters  Q  and  R  then  Is  exactly 
analogous  to  maximizing  the  usual  multivariate 
normal  likelihood  function  and  yields  the 
regression  estimators 


4*(l+l)  -  St(l)lSt-i(0))- 


(3.5) 


Q(l+l) 

where 


T-MSt(0)-St(l)(St-i(0))-lSt(l)j  ,  (3.6) 


■A-' 
.  ^  * 
•  w  • 


for  J-0,1,  and 


R(l+1)  -  I  +  A  Pj  A  ). 

t-l 


it  ■  Zt  ~  AyXf  .  (3.9 

The  term  lavolvlng  and  £  haa  only  a  alngle 
observation  and  «e  arbitrarily  fix  £  and  take 


The  Kalman  smoother  can  be  used  to  compute  all 
the  terms  In  (3.7)  except  S^(l),  which  Involves 

Pt,t-i  “  co»(£t.i.t-ilii . xr)  • 

Shumway  and  Stoffer  (1982)  have  given  the^follov- 

Ing  backward  recuralona  for  determining 

for  t-TgT'l, a  a  a g2.  The  beaic  recursion  uses 

Pt-l,t-2"Pt-l,t-lJt-2+Jt-l(Pt,t-l-»*’t-l,t-l)-Jt-2. 


where  we  start  with 

•’t.T-I  *  (i  ■  ''tAt)**’t-1,T-1  •  (3.13) 

The  overall  procedure  can  be  regarded  as  simply 
alternating  between  the  Kalman  filtering  and 
smoothing  recursions  and  the  multivariate  normal 
maximum  likelihood  equations  (3.S)-(3.10).  Ue 
simmarlse  Che  Iterative  procedure  as  follows: 

1.  Initialize  poi  40>  ')0>  Ik)  A"*!  £. 

2.  Use  the  Kalman  recuralona  (2.3)-(2.9)  to 

T  T  T 

calculate  Xz,  P^z  and  z-\. 

3.  Evaluate  the  log  likelihood  (3.3). 

4.  Update  parameters  to  uj,  9^,  Qj,  using 
Equations  (3. S)-(3. 10) . 

5.  Return  Co  step  2. 

One  of  the  advantages  of  Che  EM  algorithm 
results  from  the  simplicity  of  standard  multi¬ 
variate  normal  calculations  which  depend  only  on 
output  from  Che  forward  and  backward  Kalman 
recursions.  Successive  steps  of  the  form  (3.4) 
never  decrease  the  likelihood  function  and  one 
Is  guaranteed  to  converge  to  at  least  a  local 


maximum  of  the  log  likelihood  function  under 
fairly  mild  regularity  conditions  (see  Wu 
(1984)).  While  the  convergence  rate  of  the  EM 
algorithm  Is  somewhat  slower  than  that  possible 
with  Newton-Raphaon  or  scoring  algorithms  (In 
the  neighborhood  of  the  maximum) ,  one  may  be 
able  to  avoid  the  large  divergent  atep  correc¬ 
tions  which  are  characteristic  of  these  latter 
two  procedures  In  the  mulclparameter  situation. 

An  attractive  feature  available  within  the 
state-space  framework  relates  to  the  ability  to 
treat  series  which  have  been  observed  Irregu¬ 
larly  over  time.  The  KM  algorithm  allows  one  to 
have  parts  of  the  observation  vector  ^  missing 
at  a  number  of  observation  times  without  Invali¬ 
dating  the  computational  procedures  described  In 
the  previous  two  sections.  An  especlslly  simple 
procedure  results  for  the  special  case  where  Che 
unobserved  and  observed  parts  of  the  error 
vector  2^  are  uncorrelated. 


Suppose  that  at  a  given  step,  we  define  the  par¬ 
tition  of  Che  rxl  observation  vector 

It  -  xt^^  the  n  x  1 

observed  portion  and  la  the  r2  x  1 

unobserved  portion  leading  to  the  partitioned 
form 


Ip 

and  A^*^  . 

matrices  and 


are  ri  x  p  and  r2  x  p 


( All  R12 


v!t*7  "22 


Stoffer  (1982)  established  that  Equations  (2.3)- 
(2.10)  hold  for  the  missing  data  case  given 
above  If  one  makes  the  replacements 

lx.  “  (xt  ^  .2.  )  *t  ■  (At  '  .0  ).  end  R12  “ 
R21  •  Oa  That  Is,  If  jTf  Is  Incomplete,  the 
filtered  and  smoothed  estimators  can  be  calcu- 
lated  from  the  usual  equations  by  entering 
zeroes  in  the  observation  vector  vhere  data 
is  missing  and  by  zeroing  out  the  corresponding 
row  of  the  design  matrix  A^a  This  leads  to  the 


(T) 

saoothed  estlnators  '  and  the  covariance 
(Ti  (T) 

functions  fldsslng  data  case* 


The  maximum  likelihood  estimators,  as  computed 
in  the  EM  procedure,  require  that  one  take  the 
conditional  expectation  of  (3*4)  under  the  as- 
sumption  that  ^  is  incompletely  observed*  Now, 
defining  the  Incomplete  data  as 
Y^l)  ■  •  the  expectation  of 

the  third  term  can  be  computed  by  conditioning 
first  on  both  and  Xt  then  on  which 

leads  to  (cf*  Shunway  and  Stoffer  (1982), 

Shumway  (1984)) 

-1 " 

R{1+1)  -  n  S  DtGtDt  (5.16) 


where 

Gt  - 

with 


^G^^^f'+R22.1  ) 


F  -  R2iRi}  , 


“22.1  ■  R22  '  R21*UR12  ■ 


(3.17) 

(3.18) 

(3.19) 


and 


-d)  .  ,T  ,T'  .  *(l)p(T).(l)’ 
'•t  “  =t  +  At  At 


where 


J  .  „d)  .  *(1)  T 

it  2t  At  . 


(3.20) 

(3.21) 


The  matrix  Dt  la  a  permutation  matrix  which 
reorders  the  variables  in  their  original  form. 
This  Is  necessary  because  the  application  of 
(3.17)-(3.20)  requires  that  the  variables  be 
ordered  so  that  the  observed  values  appear  in 


A  simplification  introduced  in  Shumway  and 
Stoffer  (1982)  is  to  assume  that  the  errors 
relating  the  unobserved  and  observed  components 
are  uncorrelated,  i*e*  8^2  ”  the 

correction  (3*17)  reduces  to 


Gt 


(3.22) 


the  previous  Iterate. 

A.  EXAMPLES 

A.l  An  Irregularly  Observed  Biomedical  Serlea 

In  order  to  give  an  Illustration  of  an  Incom¬ 
plete  series,  consider  Che  problem  of  modeling 
the  level  of  several  biomedical  parameters 
monitored  after  a  cancer  patient  undergoes  a 
bone  marrow  transplant.  The  data  in  Figure  3.1, 
presented  by  Jones  (1984),  are  measurements  made 
for  92  days  on  the  three  variables  log(whlte 
blood  count) ,  log(platelet)  and  HCT(hematocrlt) . 
Approximately  40X  of  the  values  are  missing, 
with  Che  missing  values  mainly  occurring  after 
the  3SCh  day.  (The  missing  values  are  shoim 
along  the  time  axis  on  the  plotted  series) .  The 
main  objectives  In  this  example  are  to  model  the 
three  variables  using  the  state-space  approach 
and  to  smooth  the  data.  According  to  Jones 
(1984),  "Platelet  count  at  about  100  days  post 
transplant  has  previously  been  shown  to  be  a 
good  indicator  of  subsequent  long  term 
Survival." 
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Figure  1  ~  Bone  marrow  transplant  data  (Jones 
(1984)). 


If  the  vector  observation  has  all  components 
missing,  the  correction  reduces  to  adding  R  from 


The  simple  state-space  model  with  three  compo¬ 
nents  was  chosen  with  the  observed  log(WBC), 
log(platelet)  and  HCT  denoted  by  y^^,  y2t 


and  y3(.  and  the  unknown  true  levels  denoted  by 
xit>  and  X3(.  The  true  vector  process 
satisfies  the  state  equation 


4.2  Signal  Extraction  for  Soil  Sciences  Data 


*1? 

!  .981 
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*i,t-r 

“U 
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- 
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*2,t-l 

+ 

*2t 

I53t, 

\-l,078 

1.811 

.823/ 

f3,t-l. 

-3t. 

As  an  exaaple  of  a  simple  signal  extraction 
problem  consider  the  following  example  from 
Shuaway  (1985)  Involving  salt  content  values 
measured  at  Intervals  of  one  meter  over  a  line 
transect.  Figure  3  shows  the  average  of  five 
such  transects  (parallel  samples)  taken  from 
Morkoc  et  al  (1984). 


where  the  transition  matrix  was  extlmated  after 
30  Itera^ons  of  the  EH  algorithm.  The  state 
and  observation  covariance  matrices  were  esti¬ 
mated  as 


^.014 
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.01 3\ 

/'.007 

0 

o\ 
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.003 
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.027 
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1 

0 

.017 
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\  .013 

.027 

3.485/ 

1  ° 

0 
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Again,  the  coupling  between  the  first  two  series 
and  the  third  series  Is  relatively  weak.  The 
regression  relating  X3((HCT)  to  the  other  two 
series  seems  to  be  fairly  strong,  l.e. 

X3t  -  -1. 078x1, t-l+l-811»2,t-l+-823x3,t-l+<'3t 

The  smoothed  values,  as  evaluated  using  the 
Kalman  recursions,  are  shovn  In  Figure  2  below* 

The  approximate  standard  errors 
interpolated  missing  values  In  the  latter  parts 
of  the  series  are  in  the  ranges  sll-.ll,  o09-*09 
and  le7*2«0  for  the  three  series  respectively* 
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Figure  2  -  Smoothed  bone  marrow  transplant  data 
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Figure  3  Average  salt  content  over  five 

transects  (1  pt  «  1  m)*  (Morkoc 
et  al  (1984)). 


It  is  plausible  that  the  salt  content  can  be 
represented  as  a  non-atatlonary  trend  function 
superimposed  on  noise.  We  might  assume  (see 
Shunway  (1985))  that  the  observed  salt  content 
at  the  spatial  point  s,  say  y^,  can  be 
represented  as 

y»  ■  *s  ’'s  (*•*) 

where  Xg  ts  the  smooth  trend  function  and  v.  Is 
Che  Irregular  white  noise  component  with 
2 

variance  oy.  The  basic  objective  Is  to  produce 
an  estimator  for  the  nonstattonary  trend 
function  Xg.  In  order  to  specify  smoothness 
constraints  for  the  trend  function  Xg  we  might 
assume  that  the  second  difference  (derivative) 
Is  small,  say 


vZxg  -  wig  (4.2) 

where  V  Is  the  usual  difference  operator  and  wig 
2 

Is  a  noise  with  variance  o„.  There  Is  an  obvi¬ 
ous  similarity  here  to  spline  smoothing  (see 
Wecker  and  Ansley  (1984)).  Now,  since 

V^Xg  •  Xg  -  2xg-i  +  Xg_2  ,  (4.3) 


It  Is  clear  that  by  defining  the  state  vector 
^  •  (ag,Xg-i)',  the  model  In  Equations  (4.1) 
and  (4.2)  can  be  written  In  the  state-space  fora 


■  I  'J  ■« 


and  Che  obvious  Identifications  can  be  aade  In 
(l.l)  and  (1>3)>  The  transition  matrix  4  la 
fixed  In  this  case  and  we  have  only  to  estimate 
the  variances  o$  and  oj  associated  with  the 
observation  and  model  noises  respectively.  The 
2 

estimator  for  o^  comes  from  q^i  In 
Q  -  T"Vsc(0)-St(l)#'-»St{l)  -t4St_i(0)*’)  (4.6) 


I 

N 


where  ^  is  the  fixed  transition  matrlxs  The 
2 

estimator  for  Oy  follows  directly  from  (la8)  as 
usual,  the  final  estimators  for  the  variances 

are  o^  - 


.102.  o, 


>  .021. 

T- 

The  smoothed  values  under  this  model  are 
plotted  in  Figure  4  and  it  Is  clear  that  the 
smoothed  values  follow  the  major  turns  In  the 
data  quite  well.  The  resulting  smoothed  series 
has  a  prediction  standard  error  of  .16. 
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Figure  4  Smoothed  salt  content  using  (4.1)  and 
(4.2)  with  oj  -  .102,  o*  -  .021. 


4.3 


Fcrecaatlng  and  Seaaonal  Adjustment  of 
Economic  Series 


The  Inherent  flexibility  of  the  state-space 
model  can  be  exploited  for  developing  additive 
models  for  economic  time  series.  The  use  of 
state-space  methods  for  analyzing  additive 
models  of  Importance  In  economics  has  been 
proposed  by  Kitagawa  (1981),  Kitagawa  and  Gersch 
(1984)  and  Harvey  (1983).  As  an  example,  con¬ 
sider  the  quarterly  data  on  earnlngs-per-share 
shown  In  Figure  3  for  the  U.S.  company,  Johnson 
and  Johnson.  The  general  character  of  the 
series  seems  to  emerge  as  an  exponential  trend 
with  a  seasonal  kind  of  oscillation  superimposed 
on  this  trend;  the  seasonal  oscillation  tends  to 
repeat  every  four  quarters. 
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Figure  5  -  Quarterly  earnings  per  share  (1970(4) 
to  1980(1)  and  7  quarter  forecast 
for  Johnson  and  Johnson. 


In  order  t'  develop  an  additive  model  for  this 
particular  kind  of  data,  suppose  that  we  regard 
the  observed  series  ‘y^  as  being  composed  of 
trend,  seasonal  and  Irregular  components, 
denoted  by  xj^,  X2t  and  v^  respectively.  The 
observed  data  can  be  modeled  as 

Ft  ■  Mt  *2t  +  ’t  . 

where  the  exponential  trend  component  might  be 
modeled  as 


*lt  -  9*1, t-l  +  *lt  .  (*•*) 

where  4>1  represents  the  growth  rate.  The 
quarterly  seasonal  component  might  be  modeled  as 

*2t  ”  -»2,t-l  -  *2,t-2  -  *2,t-3  +  »2t.  (*•’) 

reflecting  the  fact  that  the  sum  of  the  four 
quarters  should  be  approximately  0  for  the 
seasonal  factor.  The  problems  of  Interest  for 
the  model  can  be  reduced  first  to  estimating  the 
parameters  and  then  the  unobserved  components 
X]{  and  X2t.  One  would  also  like  to  be  able  to 
forecast  y^.  A  problem  of  some  Interest  In 
economic  applications  Is  In  estimating  the 
series  with  seasonal  effects  excluded,  l.e. 

(xjt;  +  xjc),  sometimes  termed  seasonal 
adjustment. 


The  model  specified  by  (4.7),  (4.8)  and  (4.9) 
can  be  put  Into  state-space  form  by  defining  the 
state-vector  Xjt  -  (^it  .*2t  >’'2,t-l  ►*2,t-2) ' » 
chat  the  observation  Equation  (1.1)  becomes 

r^t  1 


Ft 


(l.l, 0,0) 


*2t 

*2,t-l 


+  vt 


(4.10) 


L’'2,t-2j 


with  the  state  Equation  (1.3)  given  by 


*lt 

♦ 

0 

0 

“N 

0 

M,t-1 

»lt 

*2t 

0 

-1 

-1 

-I 

*2,t-l 

+ 

*2t 

*2,t-l 

0 

1 

0 

0 

*2,t-2 

0 

*2,t-2 

0 

0 

1 

0 

*2,t-3 

0 

where 

qil  0  00 

0  q22  0  0 


R  -  rii,  Q  - 


0  0  0  0 
0  0  0  0 


(4.11) 


(4.12) 
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gives  the  two  covariance  structures*  Harvey 
(1981»  p*  180)  shows  that  this  isodel  with  ^*1 
is  essentially  an  ARIMA  (0,1,1)  x  (0,1, 1)4  which 
has  been  applied  to  accounting  data  by  Griffin 
(1977), 

The  conputational  nod ificat ions  required  for 
this  state^space  model  are  minor  since  q^j^  and 
q22  can  now  be  obtained  as  the  first  two 
diagonal  elements  in  Q  defined  by  (4*6)*  The 
estimated  transition  parameter  ^  is  Just  the 
ratio  of. the  upper  left  corner  elements  of  8^(1) 
and  St-i(O).  That  la 

[S,(l)]ii 

■  [sA(b)Tii 

where  [AJ^j  denotes  the  IJ^^  element  of  the 
matrix  A* 

Table  I  shows  the  successive  estimators  for  the 
four  parameters  as  applied  to  the  Johnson  & 
Johnson  data* 


Table  1  -  Successive  parameter  estimates  for 
earnings-per-share  for  Johnson  & 
Johnson  using  additive  model 


Iter 

♦ 

<111 

922 

m 

2XogI. 

1 

1.028 

.010 

.010 

.033 

-93.96 

2 

1.036 

.012 

.029 

.062 

-  5.31 

3 

1.037 

.012 

.047 

.068 

3.55 

h 

1.037 

.oil 

.061 

.066 

6.26 

5 

1.037 

.011 

.072 

.062 

7.34 

6 

1.037 

.010 

.080 

.057 

7.85 

7 

1.037 

.010 

.085 

.054 

8.13 

8 

1.037 

.010 

.088 

.051 

8.30 

9 

1.037 

.010 

.090 

.048 

8.42 

10 

1.037 

.010 

.092 

.046 

8.50 

11 

1.037 

.010 

.097 

.038 

8.74 

12 

1.037 

.010 

.096 

.037 

8.77 

13 

1.037 

.010 

.096 

.036 

8.78 

14 

1.037 

.010 

.096 

.035 

8.80 

the  log  likelihood  converges  nicely  to  a  local 
maximum  although,  at  the  tenth  iteration,  the 
process  was  stopped  and  the  seasonal  and 
irregular  component  variances  were  Incremented 
strongly  in  the  directions  that  were  suggested 
on  examination  of  previous  iterations*  The 
final  value  of  1*037  for  the  parameter  ^  implies 
that  the  exponential  growth  rate  is 
approximately  3*7  percent  per  quarter* 

Ihe  values  of  the  parameters  given  In  Table  1 
were  then  used  to  estimate  the  trend  X|^  and 
seasonal  components  X2t  of  the  model*  These  are 
shown  in  Figure  6  and  we  note  that  the  estimated 
**trend  plus  seasonal,"  say  x^  *1*  i^2t»  Produces 
credible  version  of  the  original  series*  The 
estimated  trend  might  be  taken  as  a  seasonally 
adjusted  version  of  the  series. 
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T 

Figure  6  ~  Estimated  trend,  x{(,  and  trend  plus 
T  T 

seasonal,  x^  4-  X2(i  for  the  earnings  data* 


A  fundamental  question  of  interest  here  would  be 
in  producing  forecasts  for  the  series,  say 

T  T  T 
Vt  ■  *lt  +  *2t 

for  t>T.  It  Is  clear  that  adding  the  Kalnan 
smoother  outputs  for  the  first  tvo  components  of 
will  generate  these  forecasts  and  that  the 
mean  square  error  for  the  forecasts  can  be  com¬ 
puted  as 

(4)'  -  [Pjtjll  +  2[P?t]l2  +  [P?t]22  . 

where  (PttJlJ  denotes  the  IJlh  element  of  pjf 
Table  2  shows  a  three-quarter  forecast  for  the 
second  through  fourth  quarters  of  1980  compared 
with  the  actual  values.  There  seems  to  be  quite 


15 


1.037 


.010 


,096  .035 


8.80 


good  agreeaent  between  the  observed  and  pre¬ 
dicted  values  and  all  three  prediction  Intervals 
Include  the  true  values. 


Table  2  -  Coaparlson  of  observed  earnings  and 
forecasts  for  Johnson  S  Johnson 


Qtr 

Obsvd 

Frcat 

•Error 

approx  952  PI 

1980(2} 

14.67 

14.97 

.02 

13.97-15.97 

1980(3) 

16.02 

16.77 

.05 

15.75-17.79 

1980(4) 

11.61 

12.24 

.05 

11.22-13.26 

*Error  •  |observed-forecast|/obaerved 


The  seven  quarter  forecasts  are  appended  to  the 
original  observed  series  In  Figure  5  and  can  be 
seen  to  provide  a  very  plausible  forecast  of  the 
underlying  earnings  series. 

S.  DISCUSSION 

The  application  of  the  state-space  approach  to 
nodellng  tine  series  data  In  economics  and  In 
the  biological  and  physical  sciences  has  been 
hindered  by  the  lack  of  accessible  computing 
power  and  software.  Although  the  model  la 
Inherently  appealing,  the  process  of  developing 
software  for  the  computationally  Intenalve 
recursive  and  Iterative  procedures  for  smoothing 
and  for  parameter  estimation  has  been  slow  and 
painful.  This  paper  has  described  a  proposed 
procedure  by  which  the  Kalman  filter  and  EM 
algorithm  can  be  combined  to  solve  simultane¬ 
ously  the  problems  of  smoothing,  forecasting  and 
parameter  estimation. 

The  software  for  performing  these  computations 
Is  available  In  BASIC  for  microcomputers  and  In 
FORTRAN  for  large  scale  mainframes.  The  BASIC 
version  for  microcomputers  Is  currently  avail¬ 
able  and  running  on  a  Tandy  1200HD  (or  IBM  PC, 
PC-XT)  using  HSDOS.  For  a  sample  of  61 
observations  from  three  time  series,  each  Itera¬ 
tion  required  lA  minutes.  An  earlier  version 
running  on  a  TRS-80,  Model  III  required  3A 
minutes  per  Iteration.  The  availability  of 
FORTRAN  and  BASIC  compilers  combined  with  an 
8086  chip  should  reduce  these  running  times 
significantly.  A  version  In  FORTRAN  trrltten  for 
the  CDC-6600,  required  only  one  minute  per 
Iteration  for  a  sample  Involving  600 
observations  from  each  of  seven  time  series.  A 
listing  of  a  FORTRAN  program  which  uses  a  quasi 
Newton-Raphson  algorithm  Instead  of  the  EM 
algorithm  appears  In  Jones  (1984). 
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ABSTRACT 

The  design  of  a  software  package  to  help  a  user  perform 
spectral  analysis  is  described. 


1.  Introduction 

Spectral  analysis  is  widely  used  in  the  engineer¬ 
ing  and  physical  sciences,  but,  because  of  its  com¬ 
plexity,  there  are  many  pitfalls  to  its  successful 
application.  There  are  currently  a  number  of 
software  packages  that  can  do  the  numerical  com¬ 
putations  that  are  required  for  spectral  analysis, 
but  none  of  them  offer  extensive  guidance  for  the 
user.  Recent  developments  in  computer  science 
have  made  it  feasible  to  construct  intelligent 
software  in  the  form  of  expert  systems  that  mimic 
the  actions  of  a  human  expert  in  such  diverse  fields 
as  medicine,  geology,  and  computer  installation. 
Moreover,  Gale  and  Pregibon[3]  have  made  a  first 
attempt  at  constructing  an  expert  system  for  sta¬ 
tistical  analysis,  namely,  the  RCT  system  for  regres¬ 
sion  analysis. 

Because  of  these  developments  and  the  recent 
availability  of  powerful  computer  workstations  with 
high  resolution  graphics,  we  are  developing  a 
software  package  on  such  a  workstation  to  help 
scientists  perform  spectral  analysis.  The  research 
questions  that  our  project  addresses  are:  1)  what  is 
a  good  way  to  incorporate  intelligence  into  a 
software  package?  2)  what  help  can  a  software 
package  provide  a  user  for  organizing  the  results  of 
a  spectral  analysis?  3)  is  it  possible  to  develop  a 
systematic  strategy  for  spectral  analysis  such  that, 
given  a  time  series  that  may  be  regarded  as  a  reali¬ 
zation  of  a  stationary  process  and  given  some  or  no 
a  priori  knowledge  on,  the  shape  of  its  underlying 
spectrum,  no  important  features  of  the  data  are 
missed?  and  4)  what  new  tools  for  spectral  analysis 
are  possible  on  a  state-of-the-art  workstation?  In 
this  report  we  concentrate  on  the  first  two  of  these 
questions. 

2.  Desired  Features  for  an  Ideal  Software  Package 

What  exactly  do  we  feel  is  lacking  in  available 
software  for  doing  spectral  analysis?  For  heavy 
users  of  interactive  statistical  packages  such  as  S 
and  ISP,  one  deficiency  is  a  lack  of  a  data  base 
management  system.  In  the  course  of  a  spectral 


analysis,  a  user  can  produce  a  large  number  of  new 
auxiliary  data  sets  that  are  formed  by  manipulating 
the  original  lime  series.  (In  a  recent  analysis  of 
some  wind  speed  data,  one  user  produced  over  50 
auxiliary  data  sets.)  Keeping  track  of  all  these  new 
data  sets  is  a  real  problem.  It  is  a  common  experi¬ 
ence  amongst  analysts  to  be  unable  to  recall  with 
the  passage  of  lime  where  all  the  auxiliary  data  sets 
came  from.  An  ideal  software  package  would  pro¬ 
vide  some  way  to  organize  these  data  sets  automati¬ 
cally. 

A  second  desirable  feature  is  more  extensive 
graphical  capabilities  than  current  software  pack¬ 
ages  generally  provide.  The  availability  of  worksta¬ 
tions  with  enough  power  to  quickly  update  a  graphi¬ 
cal  display  (so-called  real-time  graphics)  opens  up  a 
whole  new  category  of  displays  that  a  user  would 
like  to  have  available. 

A  third  area  in  which  software  can  aid  a  user  is 
to  provide  help  in  the  specification  of  parameters 
for  sophisticated  methods  such  as  robust  fitting  of 
autoregressive  models.  Here  the  statistical  metho¬ 
dology  has  become  so  complex  that  even  the 
designers  of  the  methods  have  difficulty  in  applying 
them  without  constantly  referring  to  their  own 
technical  reports. 

For  inexperienced  users,  the  main  problem  with 
current  software  is  the  lack  of  in-depth  help.  An 
ideal  software  package  should  do,  guide,  explain, 
and  even  teach  the  techniques  of  good  spectral 
analysis.  Loosely  speaking,  augmenting  software  to 
provide  such  help  is  called  making  the  software 
more  "intelligent". 

3.  An  Example  of  Spectral  Analysis 

In  order  to  incorporate  intelligence  into  spec¬ 
tral  analysis  software,  it  is  helpful  to  develop  a 
model  of  how  a  human  expert  does  spectral 
analysis.  To  focus  our  discussion  below,  let  us 
quickly  step  through  an  example  of  a  spectral 
analysis  (the  reader  is  referred  to  Priestley[8]  and 
Bloomfield[l]  for  a  complete  discussion  of  the  sta¬ 
tistical  theory  used  here).  The  time  series  for  our 
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example  is  monthly  average  values  of  the  daily 
water  flow  of  the  Willamette  River  at  Salem,  Oregon. 
We  begin  by  examining  a  plot  of  the  data  versus 
time  (figure  la).  We  note  immediately  the  marked 
cyclical  behavior  of  the  data.  There  is,  however,  a 
problem  with  regarding  this  series  as  a  realization 
of  a  stationary  process,  namely,  there  is  much  less 
variability  in  the  series  at  the  low  points  of  each 
cycle  than  at  the  high  points. 


Since  the  data  are  all  positive,  we  might  con¬ 
sider  looking  at  the  logarithm  of  this  data  in  an 
attempt  to  stabilize  its  variance  over  time.  (For 
some  purposes  for  which  spectral  analysis  is  used, 
such  a  transformation  would  not  be  desirable  even 
if  it  did  stabilize  the  variance;  we  assume  that  this 
is  not  the  case  here.)  This  transformation  is  shown 
in  figure  lb.  We  see  that  the  variability  of  this 
series  is  much  more  uniform. 


i  a;  plot  of  Willamette  River  data 


Figure  1;  Harmonic  Analysis  of  River  Flow  Data,  I. 


Since  the  sampling  time  is  one  month,  figure  lb 
shows  that  the  period  of  the  phenomena  is  about 
one  year  (as  one  would  suspect  from  physical  con¬ 
siderations).  This  plot  suggests  that  this  time 
series  may  be  modeled  by  a  harmonic  process  of 
the  form 

K 

=  Mjt  +  S  |>l*cos((yfcO  +  ^*sin(QfcOt  +  .  (0 

where  /j-x,  K,  Mtj.  and  |uti  ®re  unknown  con¬ 
stants  and  |E(j  is  a  zero  mean  stationary  process 
with  variance  af  and  spectral  density  function  hx{). 
If  |e(J  were  a  white  noise  process,  the  spectrum  for 
\Xtl  would  be  completely  determined  by 
jiDti.  and  of. 

Our  first  task  is  to  .estimate  K,  the  number  of 
sinusoids  with  distinct  frequencies  in  the  model, 
and  the  corresponding  u*  ‘s.  The  standard  way  to  do 
this  is_  to  look  ^r  peaks  in  the  periodogram  of 
where  X  is  the  sample  mean.  Figure  Ic 
shows  that  there  is  one  prominent  peak  in  the 
periodogram  near  the  angular  frequency  with  a 
period  of  one  year  (n/'6"<.  166677T  radians  per 
month,  indicated  by  the  dashed  vertical  line).  This 
peak  is  10  db  above  all  other  peaks,  so  we  should 
include  a  term  in  our  model  to  account  for  it  (if 
there  were  any  doubt  as  to  the  significance  of  the 
peak,  we  could  appeal  to  a  formal  statistical  test 
such  as  Fisher’s  g  or  Siegel's  test[7]). 

Besides  the  peak  corresponding  to  an  annual 
period,  there  are  numerous  other  bumps  in  the 
periodogram  that  may  or  may  not  be  due  to  other 
sinusoided  components.  If  we  assume  that  the 
expected  variation  in  the  river  flow  is  periodic  with 
a  period  of  one  year  but  is  not  necessarily 
sinusoidal,  we  would  expect  to  see  peaks  at  frequen¬ 
cies  that  are  harmonics  of  rr/6.  These  harmonics 
are  indicated  in  figure  lb  by  vertical  dotted  lines. 
We  see  that  the  second  largest  peak  in  the  periodo¬ 
gram  does  occur  at  the  first  harmonic  (rr/G).  There 
are  no  other  peaks  that  seem  to  be  particularly 
prominent.  (Again  Siegel's  test  can  help  us  judge 
the  significance  of  questionable  peaks.) 

To  see  if  we  can  identify  some  components  that 
may  be  hidden  due  to  leakage  from  the  dominant 
peaks,  figure  2a  shows  the  periodogram  for  the  data 
after  it  has  been  tapered  with  a  100%  cosine  taper. 
Again  there  are  lots  of  bumps  besides  the  dominant 
two  we  have  already  identified,  none  of  which  seem 
to  be  particularly  prominent. 

Based  upon  our  examination  of  the  plots  in 
figure  1,  let's  assume  a  model  given  by  equation  (1) 
with  /f  =2  and  u*  =fcTr/6  for  which  ltd  is  a  white 
noise  process.  This  is  a  simple  linear  regression 
model  which  we  can  fit  to  our  data  using  least 
squares.  Figures  2b  and  2c  show  the  residuals  from 
this  fitted  model  plotted  versus  time  and  offset 
from  the  beginning  of  a  year,  respectively.  To  con¬ 
tinue  the  analysis  of  this  data,  we  would  carefully 


study  these  residual  plots  to  judge  the  adequacy  of 
our  model. 

There  are  two  comments  we  should  make  about 
this  analysis.  First,  the  actions  that  we  have  out¬ 
lined  are  not  a  literal  record  of  what  an  expert  did. 
Some  false  starts  and  "snooping  around"  have  been 
removed.  Second,  for  this  time  series,  if  our 
assumed  model  were  true,  we  would  have  only  one 
estimate  for  the  spectrum  (ignoring  minor  varia¬ 
tions  such  as  fitting  the  model  by  some  criterion 
other  than  least  squares).  For  time  series  that 
must  be  modelled  by  a  purely  continuous  stationary 
process  (i.e.,  the  spectrum  is  determined  by  a  spec¬ 
tral  density  function),  there  is  a  subjective  element 
introduced  by  the  choice  of  such  things  as  data 
tapers,  prewhitening  filters,  window  smoothing 
parameters,  and  order  of  autoregressive  models. 
Hiese  choices  result  in  a  wide  variety  of  different 
spectral  estimates.  Unless  we  have  some  external 
information  about  a  time  series,  there  is  no  way  of 
telling  which  estimate  is  closest  to  the  "truth." 
Moreover,  since,  to  quote  Tukey[8],  "...  most  spec¬ 
trum  analysis  is  exploratory  in  character,"  it  is 
often  not  the  goal  to  pick  one  of  these  estimates  as 
the  best  estimate,  but  rather  we  want  to  look  at 
many  different  spectral  estimates  to  try  to  under¬ 
stand  our  data  and  to  look  for  interesting  features 
in  it. 

4.  Prototype  Expert  System  for  Spectral  Analysis 

Our  first  attempt  to  incorporate  intelligence 
into  spectral  analysis  software  was  to  develop  a  pro¬ 
totype  expert  system.  We  built  the  system  using 
computer  hardware  and  software  available  to  us  in 
1904,  namely,  a  VAX  750  with  primitive  graphics  ter¬ 
minals  running  under  the  4.2  BSD  UNIX  operating 
system  with  Franz  LISP  and  0PS5,  a  programming 
language  for  a  production  system.  Such  a  system 
requires  that  the  knowledge  of  an  expert  be  sum¬ 
marized  in  production  rules  of  the  general  form  "if 
A.  B.  ...  are  true,  then  assert  action  C."  Our  first 
task  was  to  extract  the  knowledge  of  an  expert  in 
this  form. 

To  do  so,  we  followed  an  expert  through  the 
analysis  of  several  "typical"  time  series  such  as  the 
river  flow  data.  We  were  able  to  come  up  with  a 
"script"  that  represented  the  decisions  and  actions 
that  the  expert  took  at  each  stage  of  the  analysis. 
Each  portion  of  the  script  was  initially  coded  into 
production  rules.  As  an  example,  a  production  rule 
that  we  could  have  included  based  upon  the  river 
flow  analysis  is  "if  the  data  is  positive  and  if  the 
variability  of  the  series  is  proportional  to  the  height 
of  the  series,  then  make  a  log  transformation." 

We  learned  several  things  from  this  exercise. 
First,  it  is  difficult  to  capture  the  expertise  involved 
in  spectral  analysis  using  just  production  rules. 
Much  of  our  script  was  purely  procedural  in  nature, 
and  this  was  rather  clumsy  to  code  with  production 
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Figure  2:  Harmonic  Analysis  of  River  Flow  Data,  11. 


rules  For  example,  in  the  river  flow  analysis,  once 
we  had  noted  the  strong  cyclical  variation  in  the  log 
of  the  original  data  (figure  la),  there  was  a  pro¬ 
cedure  that  we  followed:  we  identified  the  frequen¬ 
cies  of  the  sinusoidal  components  in  the  model 
using  the  periodogram,  fitted  the  model  to  the  data, 


and  examined  the  residuals.  We  found  it  easier  to 
write  some  of  the  purely  procedural  parts  of  the 
system  in  the  C  programming  language. 

Second,  graphical  displays  play  a  critical  role  in 
spectral  analysis.  There  are  many  features  of  data 
that  are  difficult  to  extract  by  a  statistical  measure 


but  that  are  readily  apparent  to  the  trained  eye.  To 
obtain  this  visual  information  from  an  untrained 
user,  the  expert  system  was  programmed  to  carry 
out  a  dialog  between  itself  and  the  user.  It 
presented  a  series  of  graphs  to  the  user  and  queried 
him  or  her  about  the  presence  or  absence  of  certain 
features  in  the  graphs.  If  the  user  was  unable  to 
answer  the  system's  questions,  the  system  would 
attempt  either  to  help  the  user  by  supplying  exam¬ 
ples  or  to  answer  the  questions  by  itself  based  upon 
some  test  statistics.  This  approach  exploits  the 
superior  human  visual  ability  to  And  structure  in 
graphs. 

Third,  rather  simple  automatic  mechanisms 
were  found  for  keeping  track  of  an  analysis  and  of 
the  auxiliary  data  sets  created  during  a  spectral 
analysis.  The  0PS5  code  and  C  procedural  routines 
did  their  numerical  work  by  calling  task  programs. 
The  collection  of  these  tasks  programs  is  by  itself  a 
primitive  system  for  carrying  out  spectral  analysis. 
For  example,  suppose  the  values  of  the  log  of  the 
river  flow  series  reside  in  a  data  file  called  "Irr'.  To 
taper  this  series  with  a  100%  cosine  data  taper  and 
calculate  a  periodogram  for  it  (as  was  done  in  figure 
2a).  we  woiild  give  the  following  commands  to  the 
UNIX  operating  system: 
taper  -p  1.00  Irf  Irf.tpr 
pgram  Irf.tpr  Irf.tpr.pgm 

The  tapered  lime  series  and  its  periodogram  would 
now  be  in  the  auxiliary  files  "Irf.tpr"  and 
"Irf.tpr.pgm",  respectively.  (The  names  of  these  two 
files  can  be  arbitrarily  chosen.)  Part  of  the  action  of 
both  commands  is  to  place  a  copy  of  the  commands 
themselves  at  the  end  of  a  special  file  named 
"hist. Isa".  A  list  of  this  file  at  the  end  of  an  analysis 
gives  a  complete  history  of  all  commands  that  were 
executed  during  the  course  of  an  analysis 

In  addition,  the  formats  of  "irf.tpr"  and 
"Irf.tpr.pgm"  are  special  in  that  they  contain  not 
only  data  values  but  also  a  copy  of  the  UNIX  com¬ 
mand  that  created  them.  A  special  task  program 
called  "genesis"  could  then  be  evoked  at  any  later 
date  to  find  out  how  these  two  auxiliary  files  were 
created.  Thus  the  command 
genesis  Irf.tpr  Irf.tpr.pgm 
would  yield  the  output 

Irf.tpr:  taper  -p  1.00  Irf  Irf.tpr 
Irf.tpr.pgm:  pgram  Irf.tpr  Irf.tpr  pgm 
This  simple  automatic  mechanism  has  proven  quite 
useful  for  keeping  track  of  auxiliary  data  sets  and 
could  form  the  basis  of  a  more  elaborate  data  base 
management  system.  (A  report  that  describes  this 
software  system  in  detail  is  available  upon  request.) 

The  final  lesson  that  we  learned  is  that  our 
approach  was  painfully  inadequate.  The  chief  com¬ 
plaint  from  those  who  observed  the  system  in  action 
was  that  it  was  too  rigid  and  did  not  allow  the  user 
to  "snoop  around"  easily  when  interesting  features 
of  the  data  were  displayed  by  the  system:  the  script 


became  a  straight  jacket  that  forced  the  user  to  fol¬ 
low  a  certain  course  of  actions.  In  effect,  our  script 
modelled  only  what  the  expert  did  on  the  majority 
of  occasions  and  failed  to  capture  what  was  done 
when  some  unexpected  feature  of  the  time  series  is 
revealed.  Our  system  is  imfortunalely  just  another 
example  of  a  "feeble  prototype"  (to  use  the  words  of 
Tukey[0]  in  describing  efforts  to  date  in  creating 
expert  systems  for  statistics). 

We  believe  that  a  useful  expert  system  can  be 
built  for  spectral  analysis  but  not  with  an  off-the- 
shelf  production  system  such  as  0PS5.  The  prob¬ 
lems  that  must  be  overcome  are  the  following. 
First,  a  better  way  must  be  found  to  extract  infor¬ 
mation  from  graphs.  This  is  critical  since  so  much 
of  the  information  that  an  analyst  uses  comes  from 
graphs.  For  example,  one  possible  solution  to  the 
straight-jacket  problem  is  to  enrich  the  expert  sys¬ 
tem  by  including  many  more  rules  to  represent  all 
possible  conclusions  that  an  expert  could  draw  from 
a  graph.  Under  our  current  approach,  this  would 
mean  that  the  expert  system  would  have  to  guide 
the  user  through  an  exhaustive  list  of  questions 
about  the  presence  or  absence  of  certain  features. 
This  is  not  feasible  since  such  a  scheme  would 
quickly  exhaust  the  patience  of  the  user. 

Second,  some  mechanism  has  to  be  incor¬ 
porated  in  the  system  to  allow  it  to  "forget"  certain 
"facts"  that  it  has  learned  and  all  conclusions  that  it 
has  deduced  from  these  "facts."  (This  problem  is 
called  "truth  maintenance"  in  the  expert  system 
literature.)  This  is  probably  the  chief  difference 
between  statistical  analysis  and  medical  diagnosis 
for  which  production  systems  have  been  successful. 
In  the  latter  discipline  tests  are  performed  on  a 
patient,  and  from  their  results  conclusions  are 
drawn.  The  results  of  the  tests  themselves  are 
never  really  questioned.  In  stalislical  analysis,  cer¬ 
tain  hypotheses  are  assumed  to  be  true  until  it 
becomes  obvious  that  they  are  wrong.  To  site  the 
river  flow  data  as  an  example,  if  we  hadn’t  noticed 
the  relationship  between  variability  and  value  of  the 
series  in  figure  la,  we  might  have  carried  out  a  har¬ 
monic  analysis  on  the  original  data.  When  we  got  to 
the  point  of  plotting  the  residuals,  we  hopefully 
would  have  noticed  a  cyclical  variability  in  the  resi¬ 
duals  that  would  have  lead  us  back  to  concentrate 
on  figure  la.  (To  quote  Chambers[2].  "...  data 
analysis  is  a  more  heterogeneous,  quantitative  and 
itcroMve  process  than  ...  medical  diagnosis  ...  .") 

finally,  creating  an  expert  system  that  is  pri¬ 
marily  for  non-experts  vastly  limits  the  number  of 
potential  users  of  the  system.  Experts  are  not 
interested  in  using  it  because  they  want  to  ignore 
all  of  the  "help"  facilities.  Non-experts  may  find 
them  initially  useful,  but,  after  several  runs  through 
such  a  system,  they  will  rapidly  acquire  the  exper¬ 
tise  built  into  the  system  and  will  become  bored 
with  using  it. 


5.  Display  Oriented  System  for  Spectral  Analysis 

In  January  of  1985,  we  received  lour  slale-of- 
the-art  LISP  machines  for  use  in  our  project 
through  a  grant  from  the  Department  of  Defense 
University  Research  Instrumentation  Program  with 
matching  funds  from  the  University  of  Washington. 
The  availability  of  these  machines  and  the  experi¬ 
ence  we  obtained  in  designing  our  prototype  expert 
system  caused  us  to  design  a  new  system  from 
scratch.  Our  new  approach  is  to  produce  a  system 
for  spectral  analysis  that  is  useful  for  experts  in 
such  a  way  that  it  can  be  augmented  with  various 
"help"  facilities  for  less  experienced  users 

In  order  to  produce  a  system  that  is  useful  to 
experts,  we  need  to  have  a  model  of  how  experts  do 
spectral  analysis.  Since  following  a  script  is  obvi¬ 
ously  not  what  an  expert  does,  we  have  attempted 
to  come  up  with  a  more  reasonable  model.  Our  new 
model  is  a  rather  simple  one,  namely,  that  an 
expert  does  spectral  analysis  by  carefully  examin¬ 
ing  a  sequence  of  graphics  displays  At  each  stage 
of  the  analysis  the  features  that  the  expert 
observes  in  a  display  prompt  him  or  her  to  look  at 
another  display  to  learn  something  more  about  the 
time  series. 

With  this  model  for  spectral  analysis,  a  rather 
simple  design  for  more  intelligent  software  is  possi¬ 
ble.  Our  first  task  is  to  create  a  set  of  independent 
graphics  displays  that  an  expert  finds  useful.  The 
expert  can  make  use  of  such  a  display  as  is.  but  the 
less  sophisticated  user  can  obtain  help  by  request¬ 
ing  a  list  of  features  that  he  or  she  should  be  look¬ 
ing  for.  Alternatively,  the  user  could  go  though  an 
interactive  "miniscript"  that  refers  to  only  the  one 
display  at  hand  and  that  is  designed  to  force  him  or 
her  to  note  ns  much  about  the  time  series  as  possi¬ 
ble  from  that  display.  Anything  that  the  user  learns 
about  the  time  series  from  such  a  miniscript  can  be 
stored  in  a  data  object  that  represents  the  time 
series.  (For  our  purposes  we  can  define  a  data 
object  for  a  time  series  as  a  computer 
representation  of  both  the  values  of  the  time  series 
euid  edi  other  information  that  is  known  or  has  been 
deduced  about  the  series.) 

To  clarify  these  ideas,  let  us  look  at  a  mock-up 
of  one  display  in  our  proposed  system  (figure  3). 
Each  display  consists  of  one  or  more  graphics  win¬ 
dows  and  four  "mouse"  sensitive  windows  to  control 
what  is  visible  in  the  graphics  windows  and  to  allow 
the  user  to  advance  to  other  displays.  The  mock-up 
shows  the  periodogram  display  as  it  would  be 
applied  to  the  data  object  that  contains  the  log  of 
the  river  data.  For  this  display  there  is  only  one 
graphics  window.  It  shows  the  values  of  the  periodo¬ 
gram  for  the  time  series  versus  frequency. 

The  "goodies"  window  allows  the  user  to  do 
several  things:  to  reset  parameters  that  control 
exactly  how  the  periodogram  is  calculated  and  plot¬ 
ted;  to  augment  the  basic  plot;  to  perform  some 


statistical  tests  that  are  associated  with  the 
periodogram.  to  manipulate  the  data  object  under 
study;  and  to  create  a  new  data  object  from  the 
values  shown  in  the  plot.  In  the  mock-up,  the  first 
five  items  in  this  window  show  the  user  in  bold 
letters  the  current  values  of  the  settable  parame¬ 
ters.  Thus  the  periodogram  was  calculated  from  a 
demeaned  time  series  and  by  applying  a  cosine  data 
taper  to  20%  of  the  time  series.  It  was  then 
evaluated  on  a  finer  grid  of  frequencies  than  the 
standard  frequencies.  The  results  of  these  compu¬ 
tations  were  plotted  on  a  decibel  versus  linear 
scale.  All  of  the  settable  parameters  can  be 
changed  by  moving  a  "mouse"  controlled  pointer  to 
the  appropriate  place  and  by  either  clicking  a  but¬ 
ton  on  the  "mouse"  (to.  say.  select  a  linear  "y" 
scale)  or  by  clicking  and  entering  a  value  from  the 
keyboard  (to  change  the  proportion  of  data  tapered 
from  20%  to  some  other  value).  As  soon  as  a  param¬ 
eter  is  reset,  the  plot  in  the  graphics  window  is 
automatically  updated. 

Three  augmentations  to  the  plot  are  possible  in 
this  version  of  the  periodogram  display.  One  or 
more  user-specified  fundamental  frequencies  can 
be  indicated  on  the  plot  by  vertical  dashed  lines, 
and  any  number  of  associated  harmonics  can  be 
shown  by  vertical  dotted  lines.  In  the  mock-up  a 
fundamental  frequency  corresponding  to  a  period  of 
one  year  and  its  first  five  harmonics  are  shown.  The 
third  augmentation  allows  the  user  to  plot  one  or 
more  copies  of  the  kernel  associated  with  the  data 
taper.  This  option  allows  the  user  to  identify  peaks 
in  the  periodogram  that  are  due  solely  to  window 
leakage. 

A  list  of  all  data  objects  in  the  current  analysis 
is  given  in  the  data  objects  window.  The  first  data 
objects  in  the  list  are  those  that  are  being  examined 
in  the  current  display  and  are  marked  "active".  For 
the  periodogram  display  there  is  only  one,  namely, 
the  data  object  that  contains  the  log  of  the  river 
flow  data.  The  user  can  manipulate  these  data 
objects  and  create  new  ones  by  selecting  (by  means 
of  the  "mouse")  one  of  the  final  three  items  in  the 
"goodies"  window.  The  item  "make  new  data  object" 
allows  the  user  to  create  a  new  data  object  from  the 
values  plotted  in  the  graphics  window.  The  "add 
comments"  item  lets  the  user  add  any  comments 
desired  to  any  of  the  data  objects  in  the  current 
analysis.  Finally,  the  "examine  data  object"  item  is 
used  to  look  at  all  the  auxiliary  information  that  has 
been  stored  along  with  the  actual  data  values. 

Included  with  each  graphics  display  is  a  direc¬ 
tory  of  all  other  displays.  In  the  mock-up,  after  the 
user  is  finished  looking  at  the  periodogram  display, 
he  or  she  may  select  one  of  six  graphics  displays  to 
sec  next  and  may  optionally  choose  any  of  the  listed 
data  objects  to  serve  as  the  input  to  that  display  if 
he  or  she  does  not  want  to  use  the  default  "active" 
data  object. 


/  z' 


Goodiea 


•  demean  data:  yea  no 

•  "x"  units:  db  linear 

•  "y"  units:  db  linear 

•  standard  frequencies  only:  yes  no 

•  data  taper:  cosine;  dpss 

proportion  of  data  tapered:  20% 

•  show  fundamental  frequency 

•  show  harmonics:  S 

•  show  kernel 

•  do  Fisher's  test 

•  do  Siegel's  test 

•  make  new  data  object 

•  add  comments  to  data  object 

•  examine  data  object 


Directory  of  Displays 

•  harmonic  regression 

•  wrap  around  plot 

•  white  noise  test 

•  time  series  plot 

•  make  transformation  (log,  etc.) 

•  periodogram 


Data  Objects 

•  log  of  river  flow  data  (active) 

•  river  flow  data 

•  square  root  of  river  flow  data 


•  What  should  1  be  looking  for  in  this  display? 

•  What  do  the  goodies  do? 

•  Why  should  1  look  at  other  displays? 


Figure  3:  Mockup  of  Periodogram  Display 


For  the  leas  sophisticated  user,  the  help  window 
offers  three  types  of  guidance.  The  first  help  item 
gives  the  user  a  list  of  features  (and  examples  if  so 
desired)  that  he  or  she  should  be  looking  for  in  the 
current  graphics  display.  The  system  queries  the 
user  concerning  the  presence  or  absence  of  each 
feature  and  stores  the  results  of  this  interaction  in 
the  "active”  data  object.  The  second  help  item 
explains  in  detail  (with  examples  if  necessary)  what 
each  of  the  items  in  the  "goodies"  window  does.  The 
third  item  in  the  help  window  tells  the  user  why  he 
or  she  might  want  to  look  at  other  displays.  Based 
upon  what  display  the  user  is  currently  looking  at 
and  what  information  is  known  about  the  time 


series  (as  stored  in  its  corresponding  data  object), 
the  system  will  order  the  items  in  the  directory  of 
displays  to  reflect  what  it  thinks  would  be  the  most 
informative  displays  to  look  at  next. 

Each  graphics  display  has  a  small  set  of  produc¬ 
tion  rules  that  allows  the  system  to  order  the  direc¬ 
tory  of  displays  and  explain  to  the  user  the 
rationale  for  the  order.  For  example,  the  fact  that 
the  harmonic  regression  display  is  listed  first  in  the 
directory  in  the  mock-up  may  be  due  to  some 
knowledge  supplied  by  the  user  from  one  of  two 
sources:  a  previously  examined  display  such  as  the 
time  series  plot  display  (where  the  user  might  have 
noted  "strong  periodic  variation");  or  the  feature 


extraction  question-and-answer  session  in  the 
current  display  (where  the  user  might  have  noted 
that  the  periodogram  has  "one  or  more  prominent 
narrow  peaks"). 

The  system  that  we  are  designing  around  this 
display-oriented  model  of  spectral  analysis  cannot 
be  called  an  expert  system  since  it  only  provides 
local  (i.e.,  from  one  display  to  the  next)  guidance 
and  not  global  guidance.  Its  chief  advantages  are: 
modularity  of  design  (each  display  is  independent  of 
all  other  displays);  help  to  the  user  is  added  in  a 
well-defined  way  after  each  display  has  been 
designed;  and  the  help  facilities  are  non-intrusive 
and  can  be  completely  ignored.  We  also  feel  that 
our  design  helps  alleviate  the  well-known  knowledge 
transfer  bottleneck  common  to  expert  systems 
since  here  the  expert  need  only  answer  a  few  well- 
defined  questions  to  make  the  system  "intelligent" 
("What  do  you  hope  to  learning  by  looking  at  that 
graph?".  "What  other  graphs  would  help  you  clarify 
questions  raised  by  this  graph?",  etc.). 

6.  Future  D.  rections 

We  are  currently  implementing  the  spectral 
analysis  system  described  in  the  previous  section. 
After  the  rudiments  of  the  system  are  in  place  and 
a  prototype  of  the  system  has  been  critiqued,  we 
plan  to  incorporate  as  many  graphics  displays  as 
time,  resources,  and  interest  allow.  We  also  plan  to 
augment  the  system  by  exploring  the  following 
research  topics. 

6. 1  Classification  of  Time  Series 

We  recognize  that  there  are  many  users  who 
require  more  global  help  than  our  proposed  system 
can  give  them.  One  possibility  to  provide  such  help 
is  suggested  by  Schank’s  cognitive  model  approach 
to  AI  problems,  in  which  he  defines  understanding 
as  the  ability  to  relate  the  problem  at  hand  to  one's 
past  experience.  Gersch[4]  has  recently  published 
some  results  on  nearest  neighbor  rule  classification 
of  time  series.  His  idea  is  to  have  a  data  base  of 
time  series  and  a  measure  of  dissimilarity  between 
time  series  (he  used  the  Kullback-Leibler  informa¬ 
tion  number).  Any  new  time  series  is  then  classified 
by  comparing  it  to  each  of  the  time  series  in  the 
data  base.  The  nearest  neighbor  to  the  new  time 
series  is  defined  as  that  time  series  in  the  data  base 
which  is  least  dissimilar. 

These  ideas  can  be  used  to  produce  giobal  help 
for  a  user.  The  first  step  is  to  have  an  expert  do  a 
spectral  analysis  on  a  large  number  of  different 
time  series.  For  each  time  series,  the  expert  will 
use  some  combination  and  ordering  of  graphics 
displays  and  will  create  a  certain  collection  of  data 
objects.  When  an  inexperienced  user  comes  in  with 
a  new  time  series,  it  is  classified  using  Gersch's 
scheme,  and  the  user  is  told  to  follow  the  actions 
the  expert  took  in  analyzing  the  time  series  in  the 


data  base  that  is  least  dissimilar.  (If  there  are 
several  time  series  in  the  data  base  that  are  close  in 
dissimilarity,  the  user  could  select  visually  that  one 
series  that  he  or  she  feels  to  be  closest  to  the  new 
series.) 

What  we  need  to  investigate  is  1)  whether 
Gersch's  classification  scheme  is  adequate  and,  if 
not.  whether  we  can  come  up  with  one  that  is 
(Gersch's  scheme  is  a  time  domain  one;  there  is  a 
corresponding  frequency  domain  one  that  we  plan 
to  explore);  2)  what  is  the  most  effective  way  of  tel¬ 
ling  the  user  to  follow  a  set  of  actions  in  our  system; 
and  3)  how  we  can  automatically  update  the  data 
base  of  time  series  (this  will  involve  some  issues  in 
machine  learning).  — 

6.2  Automated  Creation  of  Graphics  Displays 

One  of  the  nice  features  of  the  S  and  ISP 
interactive  statistical  packages  is  the  ease  with 
which  a  user  can  expand  the  system  by  adding  new 
functions  of  his  or  her  own  creation.  If  our  system 
is  to  be  widely  used,  we  need  to  develop  some  way 
for  the  user  to  add  new  graphics  displays.  One  of  us 
(Kerr)  will  be  exploring  this  problem  of  a  "program 
writing"  program  in  a  complex  system. 

6.3  New  Data  Analysis  Tools 

In  a  future  report[5]  we  will  give  some  answers 
to  the  fourth  question  of  the  introduction,  namely, 
"what  new  tools  are  available  for  spectral  analysis 
on  a  state-of-the-art  workstation?".  We  have  several 
promising  ideas  to  exploit  the  unique  graphical 
capabilities  of  these  machines. 
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ARTIFICIAL  INTELLIGENCE  AND  STATISTICS: 
DO  WE  HAVE  THE  CART  BEFORE  THE  HORSE? 


William  F.  Eddy 

Department  of  Statistics 
Carnegie-Mellon  University 

The  last  two  decades  have  seen  a  growing  interest  in  production  systems,  or 
rule-based  expert  systems.  Originally,  production  rules  were  statements  of  the  form 
"if  A  then  B"  and  reasoning  in  these  systems  was  simple  (albeit  tedious)  and  exact. 
Recently,  a  number  of  rule-based  expert  systems  have  been  used  on  inexact  reasoning 
(that  is,  on  uncertain  knowledge).  This  talk  will  provide  a  comparative  review  of 
some  of  the  best-known  methods  of  inference  used  in  expert  systems  and  will  argue 
that  most  of  these  methods  are  hopeless  as  models  of  human  reasoning. 


BAYESIAN  IMAGE  RESTORATION 

Stuart  Geman 
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We  develop  a  class  of  probability  image  models  that  accommodate  smoothness, 
edges,  textures,  and  other,  "higher  level",  image  attributes.  These  are  Markov 
Random  Fields  with  a  three  dimensional  graph  structure.  The  "bottom"  level  of  the 
graph  is  the  pixel  process,  corresponding  to  the  actual  digitized  image.  Successively 
higher  levels  correspond  to  increasiningly  complex  attributes,  including  locations 
and  orientations  of  edges,  line  segments,  and  polygonal  regions.  The  constructed 
distribution  is  employed  as  a  prior  distribution  on  images.  Given  a  degraded 
picture,  we  seek  the  image  that  maximizes  the  posterior  distribution  (the  so-called 
MAP  estimator).  Maximization  is  performed  by  a  highly  parallel  computational 
technique  called  stochastic  relaxation. 

We  will  present  the  results  of  experiments  with  some  simple  pictures.  These 
demonstrate:  (1)  parameter  estimation  for  the  prior;  and  (2)  blure  and  noise 

removal,  segmentation,  and  boundary-finding  at  extremely  low  signal  to  noise  ratios. 


Knowledge  liepreaeniation  for  Expert  Data  Anatyeia  Syatema 


Ronald  A.  Thisted 

Department  of  Statistics 
The  University  of  Chicago 
Chicago,  Illinois  60637 

An  expert  system  is  a  computer  program  which  performs  a  task  at  the  level  of  performance  of  a  human 
expert  with  some  years  of  experience  at  the  task.  In  this  paper  we  examine  what  it  would  mean  for  a 
computer  program  to  be  an  expert  system  for  data  analysis,  why  there  is  some  hope  that  such  a  system 
could  be  developed,  smd  what  makes  an  expert  system  diflerent  from  other  sorts  of  statistical  software 
with  which  statisticians  are  familiar.  Standard  programs  implement  algorithms  for  computations  on 
data,  which  in  turn  are  represented  using  data  structures.  The  choice  of  a  suitable  data  structure  often 
determines  the  form  an  algorithm  will  take,  and  such  a  choice  may  be  crucial  to  the  efficiency  or  feasi¬ 
bility  of  the  computation,  in  expert  systems  the  primary  “data”  are  the  fact,  heuristics,  and  strategies 
used  by  experts  to  solve  problems  in  their  domain  of  expertise.  An  appropriate  form  for  representing 
statistical  knowledge  is  a  prerequisite  for  a  successful  expert  data  analysis  system.  We  examine  some 
alternatives  for  knowledge  representation  in  this  context.  Quite  apart  from  its  potential  contribution  to 
expert  systems,  such  investigations  shed  light  on  the  nature  of  dala-analytic  expertise  and  how  such 
expertise  can  be  taught. 


1.  What  ia  an  expert  system? 

This  paper  is  an  introduction  to  the  issues  involved  in 
designing  and  implementing  an  expert  system  that  might 
be  useful  in  data  analysis,  with  particular  attention  to 
aspects  of  the  problem  of  representing  statistical 
knowledge  in  a  form  Suitable  for  computation.  Expert 
systems  differ  in  substantial  respects  from  “ordinary” 
statistical  software  systems,  and  the  dilTerences  are  fun¬ 
damental  to  an  understanding  of  the  role  that  expert 
knowledge  plays. 


process.  This  last  attribute  of  having  an  explanation 
facility  seems  crucial  and,  to  some  extent,  defining. 

Some  examples  of  successful  expert  systems,  which  are 
consulted  by  experts  in  practice,  are  DENDRAL 
(Buchanan,  Sutherland,  and  Feigenbaum,  1969;  Lindsay, 
et  al,  1980)  which  identifies  organic  chemical  compounds 
based  on  spectrographic  data;  MYCIN  (Shortlilfe,  1976) 
which  diagnoses  infectious  blood  diseases;  and  CADU- 
CEUS  (Pople,  1981),  a  system  for  diagnosis  in  internal 
medicine. 


1.1.  General  definition  and  examples. 


1.2.  Expert  systems  for  data  analysis 


Expert  systems  are  defined  partly  in  terms  of  what  they 
do.  partly  in  terms  of  how  they  do  it,  and  partly  in 
terms  of  the  principles  that  led  to  their  construction. 
There  is  some  agreement  (see  Chapter  2  of  llayes-Roth, 
Waterman,  and  Lenat  (1983),  for  instance)  that  an 
expert  system  must  perform  a  complex  task  at  the  level 
of  a  human  expert  who  has  several  years  of  experience  at 
that  task.  Several  attributes  shared  by  expert  systems 
have  emerged.  An  expert  system  must  embody  expertise, 
in  the  sense  that  it  is  based  upon  rules  which  correspond 
to  what  human  experts  do;  it  must  employ  symbolic  rea¬ 
soning,  rather  than  purely  numerical  computation  in 
solving  problems;  it  must  exhibit  intelligence  in  the  sense 
that  it  can  reason  from  basic  principles  -  and  can  recog¬ 
nize  which  principles  are  applicable  -  rather  than  being 
able  to  deal  only  with  situations  narrowly  specified  in 
advance;  it  must  be  dealing  with  a  problem  of  sufficient 
complexity  that  human  experts  are  generally  required;  it 
should  have  some  ability  to  reformulate  a  problem  from 
the  form  originally  presented  into  a  form  more  suitable 
for  analysis;  and  finally,  it  must  have  some  ability  to 
reason  (or  at  least  to  explain)  about  its  own  reasoning 


What  role  could  an  expert  system  play  in  the  practice  of 
statistics?  Several  different  “role  models"  have  been  sug¬ 
gested,  and  they  lead  to  very  different  kinds  of  pro¬ 
grams,  performing  very  different  kinds  of  tasks.  Oldford 
and  Peters  (1984)  developed  a  prototype  expert  system 
to  recognize  collinearity  in  regression  problems.  This 
program  was  designed  to  be  the  Guordion  of  the  Novice, 
in  effect,  to  prevent  the  unexperienced  user  of  regression 
from  stumbling  blindly  into  hazardous  terrain.  The 
REX  system  of  Pregibon  and  Gale  (1984),  on  the  other 
hand,  might  be  termed  a  Guide  for  the  Perplexed.  REX 
was  designed  to  guide  its  user  through  an  appropriate 
regression  analysis,  in  effect  taking  on  the  role  of  instruc¬ 
tor  as  well  as  expert.  Both  of  these  systems  assume 
users  with  little  background  in  statistics  or  data 
analysis. 

Another  role  that  experts  systems  could  play  in  statisti¬ 
cal  work  is  that  of  an  intelligent  asaiatani,  with  the 
knowledge  required  to  examine  all  of  those  things  which 
the  competent  data  analyst  knows  he  or  she  should  look 
at,  but  for  which  there  is  often  little  time  (or  patience). 
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On  this  view,  a  program  with  quite  limited  intelligence 
could  be  widely  useful;  it  would  not  even  have  to  be 
able  to  deal  with  problems,  it  would  simply  have  to  be 
able  to  recognize  the  problems  and  bring  them  to  the 
attention  of  the  expert  human  statistician.  In  the 
absence  of  a  plea  for  help  from  the  program,  the  statisti¬ 
cian  could  assume  that  no  difficulties  requiring  special 
expertise  were  present,  freeing  him  or  her  to  devote  more 
time  and  energy  to  problems  of  greater  difficulty  or  com¬ 
plexity. 

A  final  role  model  for  expert  systems  in  data  analysis, 
perhaps  the  most  ambitious  of  all,  is  that  of  an  nppren- 
tice  conauUanl.  In  this  view  the  system  would  interact 
with  a  practiced,  if  not  expert  user,  say  a  PhD  student 
in  statistics  consulting  with  a  scientist  on  a  problem  in 
data  analysis.  It  would  “look  over  the  shoulder”  of  the 
user,  making  suggestions  and  noting  possible  problems. 
The  goal  here  is  once  again  to  assist  a  user  with  some 
background  in  statistical  analysis  to  make  a  better,  more 
thorough  analysis,  and  to  bring  to  the  fore  situations 
which  may  require  more  expertise  than  either  the  pro¬ 
gram  or  its  user  possess. 

The  statistical  consulting  program  at  the  University  of 
Chicago  is  not  unlike  that  at  many  universities.  Under 
the  direction  of  two  faculty  members,  all  PhD  students 
must  participate  in  statistical  consulting  with  members 
of  the  university  community,  to  whom  consulting  ser¬ 
vices  are  offered  without  fee.  A  major  problem  is  that 
the  program  directors  are  booked  with  a  solid  three-week 
backlog  of  cases.  Many  of  these  cases  turn  out  to  be  (for 
the  statistician)  routine.  The  possible  role  of  expert  sys¬ 
tems  here  is  to  kill  the  three-week  backlog  by  not  wast¬ 
ing  the  human  expert’s  time  on  routine  matters,  while  at 
the  same  time,  providing  some  assurance  that  major 
difficulties  are  not  simply  being  overlooked. 

In  the  remainder  of  the  paper,  the  intelligent  assistant 
and  the  apprentice  consultant  models  will  be  of  primary 
interest. 


1.3.  Expert  data-analysis  systems  differ  from  standard 
statistical  software. 


A  natural  question  that  arises  is  whether  expertise  could 
be  built  in  to  existing  statistical  packages  such  as 
Minitab,  SAS,  SPSS,  and  the  like.  To  answer  this  ques¬ 
tion  it  is  important  to  understand  how  expert  software 
differs  from  the  standard  software  that  statisticians  are 
used  to  writing  and  interacting  with. 


Statistical  computer  packages  increasingly  offer  on-line 
“help”  facilities,  but  none  of  the  models  of  expert  sys¬ 
tems  outlined  in  the  previous  section  could  adequately  be 
built  upon  these  facilities.  Today,  in  order  to  receive 
help,  the  program  user  must  know  that  help  is  needed 
and  must  know  when  and  how  to  ask.  In  return,  the 
program  generally  can  give  assistance  only  so  far  as  the 
syntax  of  the  program’s  command  language.  Advice 
concerning  what  the  next  step  to  be  taken  in  the 


analysis  should  be,  or  whether  a  proposed  step  is 
appropriate,  would  require  not  only  monitoring  the 
sequence  of  commands  entered  by  the  user,  but  also 
some  ability  to  reconstruct  the  reasoning  behind  those 
commands. 

The  user-system  interaction  is  also  dilTerent.  Statistical 
computer  packages  are  designed  to  give  lots  of  answers 
for  a  few  economically  worded  questions  generated  by 
the  user.  The  expert  systems  discussed  here,  on  the 
other  hand,  are  more  adept  at  raising  questions  rather 
than  answering  them.  In  effect,  their  role  is  to  note 
aspects  of  the  data  set  that  may  render  aU  of  the 
answers  produced  by  a  standard  package  inappropriate, 
misleading,  or  meaningless. 

Finally,  the  internal  construction  of  expert  software  is 
likely  to  be  quite  different  from  that  of  standard  statistic 
cal  software  in  terms  of  control  structure.  While  flow  of 
control  in  the  latter  is  often  a  matter  of  sequential  invo¬ 
cation  of  routines  explicitly  or  implicitly  requested  by 
the  user’s  typed  commands,  the  Bow  of  control  in  expert 
systems  will  depend  more  upon  the  characteristics  of  the 
particular  data  set  under  consideration.  The  internal 
construction  of  the  expert  system  will  be  suitable  for 
more  symbolic  than  numerical  computation  (although 
today’s  numerical  computations  will  necessarily  be 
invoked  as  subroutines  to  obtain  intermediate  results), 
which  suggests  that  the  code  will  include  substantial 
chunks  of  LISP  or  Prolog.  7'he  greater  the  extent  to 
which  the  data  themselves  determine  the  statistical  com¬ 
putations  to  be  applied,  the  more  one’s  view  of  what 
constitutes  a  statistical  algorithm  becomes  distorted. 
This  leads  us  to  some  consideration  of  the  roles  played 
by  data,  algorithms,  and  knowledge  in  expert  systems. 


2.  Algorithms^  data  structures,  and  knowledge  bases. 

The  essence  of  standard  programming  as  we  understand 
it  today  is  neatly  summarized  in  the  title  of  Niklaus 
Wirth’s  book,  Algoriihms-f-Data  Structures=Programa. 
It  is  now  welhunderstood  that  the  choice  of  data  struc¬ 
ture  can  greatly  influence  the  suitability  of  alternative 
algorithms  for  particular  tasks,  and  can  also  greatly 
affect  the  performance  of  algorithms,  and  even  their 
feasibility.  (For  instance,  it  is  rather  difficult  to  carry 
out  a  binary  search  in  a  linearly-linked  list.) 

In  expert  systems  we  may  have  a  parallel  formula: 
“Knowledge-t  lnference=Expertise,”  reflecting  the 
common-vsense  notion  that  experts  both  know  a  lot,  and 
know  when  and  how  to  apply  their  knowledge.  The 
term  “knowledge”  as  used  here  represents  the  collection 
of  facts,  heuristics,  and  strategies  that  experts  use  to 
solve  problems.  A  knowiedge  6ose  is  a  structured  collec¬ 
tion  of  symbolically-represented  expert  knowledge. 

The  power  of  an  expert  system  depends  on  its  knowledge 
base.  It  must  have  adequate  coverage,  that  is,  it  must 
contain  facts,  heuristics,  and  strategies  sufficient  to  cover 
a  the  wide  range  of  problems  in  its  domain.  It  must  also 


have  an  adequate  representation  for  that  knowledge, 
suitable  for  an  appropriate  search  algorithm  to  find 
those  components  of  the  knowledge  base  which  are 
relevant  in  the  current  context.  There  are  several 
schemes  for  knowledge  representation  that  have  been 
developed  in  the  AI  literature,  of  which  a  few  seem  to  be 
particularly  well-suited  to  knowledge  about  data 
analysis. 

The  most  promising  candidates  are  production  systems 
(discussed  by  William  Eddy  and  by  Gail  Gong  in  their 
presentations  in  this  session),  augmented  transition  net- 
works,  and  frame  systems.  Production  systems  are  col¬ 
lections  of  rules  (“productions”)  of  the  form.  “If 
condiiion-A  then  action-B.’^  Taken  together,  the  collec¬ 
tion  of  productions  can  be  thought  of  as  defining  a  tree 
and  a  way  of  traversing  its  branches.  In  the  datar 
analysis  context,  each  node  in  this  tree  corresponds  to  a 
stage  in  the  data  analysis,  and  moving  from  one  node  to 
another  would  generally  correspond  to  performing  a 
small  piece  of  the  data  analysis.  Augmented  transition 
networks  can  most  easily  be  thought  of  in  this  setting  as 
adding  information  to  the  tree  which  records  the  rela¬ 
tionship  between  any  two  connected  nodes.  Finally, 
frames  are  quite  general  ways  of  organizing  knowledge; 
both  production  systems  and  ATNs  can  be  embedded  in 
the  frame  paradigm.  In  our  setting  we  can  think  of  a 
frame  as  being  a  set  of  productions  which  preserves  the 
context  in  which  the  productions  are  employed. 

The  inferential  machinery,  or  the  method  by  which  the 
knowledge  base  is  searched  to  apply  to  a  situation  at 
hand,  is  related  to  the  adequacy  of  coverage  and  ade¬ 
quacy  of  representation  of  the  knowledge  base  in  much 
the  same  way  that  algorithms  are  related  to  data  struc¬ 
tures  in  conventional  programming.  With  these  ideas  as 
background,  we  now  turn  to  consideration  of  some  issues 
involve  in  building  a  suitable  knowledge  base  for  data 
analysis. 


$.  Knowledge  engineering. 

From  the  scientific  standpoint,  knowledge  is  representa¬ 
tional.  in  the  sense  that  we  cannot  say  that  we  know 
something  (or  that  we  understand  a  phenomenon)  until 
we  can  represent  it  using  a  model  which  embodies  what 
it  is  that  we  think  we  know.  One  of  the  major  benefits 
of  publishing  scientific  papers  is  that  in  the  act  of  writ¬ 
ing,  authors  are  forced  to  come  to  grips  with  the 
difficulties,  inconsistencies,  gaps,  and  inadequacies  that 
were  simply  not  apparent  to  them  before.  The  theorem 
whose  proof  was  sketched  on  a  napkin  may  prove  to  be 
more  delicate  than  first  thought;  the  iron-clad  argument 
may  reveal  a  chink  in  the  argument.  What  is  more,  the 
concrete  representation  makes  it  pos.sible  to  transmit 
this  knowledge  to  others  in  a  way  that  is  more  feasible 
and  more  certain  than  through  observation  and  appren¬ 
ticeship. 

A  concrete  representation  is  not  a  prerequisite  to  having 
knowledge,  however.  Human  experts  by  definition  pos¬ 
sess  abilities  which  others  do  not,  and  these  abilities  are 


based  on  facts  and  methods  which  they  have  assimilated 
and  refined  over  time,  whether  they  have  done  so  cons¬ 
ciously  or  not.  Experts  often  cannot  articulate  the 
relevant  knowledge  they  possess  which  they  use  on  a 
daily  basis,  and  what  they  do  say  they  know  may  in  fact 
conflict  with  what  they  actually  use  in  practice.  Many 
experts  are  ill-prepared  by  training  or  by  inclination  to 
articulate  the  knowledge  they  use  in  rendering  expert 
judgments  accurately.  This  makes  it  difficult  to  teach 
new  people  to  become  experts  in  the  field. 

At  this  point  enters  the  knowledge  engineer.  The  term 
has  been  coined  by  AI  workers  in  expert  systems  to 
denote  an  individual  who  is  trained  in  expertise  elicita¬ 
tion  and  articulation,  a  psycho-analyst  of  expertise. 
Knowledge  engineers  typically  are  grounded  both  in 
computer  science  and  cognitive  psychology,  and  what 
they  do  is  to  work  with  a  human  expert  in  his  or  her 
domain  of  knowledge  to  elicit,  and  then  to  fashion  a  con¬ 
crete  representation  of,  the  knowledge  that  the  expert 
brings  to  bear  to  solve  difficult  problems  that  arise  in  the 
expert's  domain.  There  is  a  shortage  of  people  with  the 
qualifications  and  experience  to  do  this  work. 

Note  that  the  knowledge  engineers  themselves  are 
experts  in  a  field,  too  -  that  of  knowledge  elicitation.  To 
distinguish  this  top-level  domain  of  expertise  from  the 
domains  of  experts  to  which  it  is  applied,  following  Gale 
and  Pregibon,  we  refer  to  the  top-level  area  as  the  it  sub¬ 
ject  domain,  and  the  areas  of  application  we  refer  to  as 
the  ground  domains. 

Statistical  consulting  is  very  similar  to  knowledge 
engineering.  Statistical  consultants  are  expert  in  statisti¬ 
cal  analysis  (the  subject  domain),  and  they  apply  their 
knowledge  by  collaborating  with  experts  in  other  fields 
of  inquiry  (the  ground  domains).  Moreover,  the  first  job 
of  the  statistical  consultant  is  to  help  the  client  to  arti¬ 
culate  what  he  or  she  knows  that  is  relevant  to  the  prob¬ 
lem  (but  may  not  have  realized  consciously).  We  help 
our  colleagues  to  question  assumptions  they  make  impli¬ 
citly.  We  help  them  to  turn  from  matters  of  little  conse¬ 
quence  (“Do  I  use  n  or  n-1  here?”)  and  to  focus  on  those 
matters  that  turn  out  to  be  essential  (“Can  you 
remember  anything  at  all  about  the  experiment  that 
might  distinguish  these  two  halves  of  the  data?”  “Oh, 
yes.  They  were  run  in  different  years  by  different  techni¬ 
cians.”).  We  know  that  the  questions  people  bring  to  us 
are  usually  not  the  appropriate  questions  which  ulti¬ 
mately  get  addressed,  and  we  assist  in  the  process  of  get¬ 
ting  the  right  questions  formulated  so  that  they  can  be 
addressed. 

As  a  consequence  of  these  similarities  to  knowledge 
engineering,  statistics  as  a  discipline  has  something  to 
contribute  to  artificial  intelligence  work  in  general,  and 
to  expert  systems  research  in  particular.  We  have  been 
about  parts  of  the  knowledge  engineering  business  for  at 
least  half  a  century.  (At  the  same  time,  however,  we 
have  devoted  little  attention  to  understanding  very 
thoroughly  how  we  accomplish  what  we  do  in  this  enter¬ 
prise.)  Statistics  can  contribute  some  of  the  basic  ideas 
and  methods  of  data  analysis,  experience  in  statistical 


consultation,  and  techniques  for  elaboration  and  for 
display.  It  may  even  be  that,  despite  the  shortage  of 
trained  statisticians,  we  may  even  end  up  contributing 
bodies  to  knowledge  engineering  front  (since  the  pay  is 
better). 

Constructing  an  expert  system  which  embodies 
knowledge  about  data  analysis  or  about  statistical  con* 
suiting  involves  much  that  would  be  required  in  an 
expert  system  to  construct  expert  systems,  in  that  the 
ground  domain  for  the  system  (statistical  consulting)  is 
itself  a  high-level  domain  of  expertise  which  can  in  turn 
be  applied  in  a  number  of  ground  domains.  The  current 
elfort  by  Gale  and  Pregibon  (i98d)  to  construct  Student, 
an  expert  system  capable  of  learning  to  do  data  analysis 
in  a  variety  of  contexts  by  working  through  a  sequence 
of  problems  in  those  contexts,  is  in  effect,  an  expert  sys¬ 
tem  for  building  expert  systems.  It  is  an  ambitious 
endeavor,  which  nonetheless  shows  signs  of  great  prom¬ 
ise. 

Ilow  should  we  go  about  the  process  of  studying  what 
knowledge  we  bring  to  bear  on  statistical  problems  so 
that  we  can  construct  a  suitable  representation  for  it? 
Pregibon  and  Gale  and  others  have  used  the  device  of 
constructing  worked  examples,  carefully  annotated,  and 
diaries  of  the  analysis  process.  These  devices  can  be  cou¬ 
pled  with  explanation  to  colleagues  who  can  be  expected 
to  ask  penetrating  questions  when  the  reasoning  process 
is  not  entirely  clear,  and  can  be  assisted  by  automatic 
means  such  as  statistical  packages  which  keep  '‘journal 
liles"  of  the  sequence  of  commands  used  in  analyzing  a 
data  set.  Thisted  (1984)  has  described  the  role  that 
computing  software  environments  can  play  in  learning 
about  how  data  analysts  behave  and  what  strategies 
they  adopt.  On  this  view,  a  considerable  amount  can  be 
learned  about  the  process  of  statistical  analysis  without 
actually  attempting  to  implement  any  of  it  in  an  actual 
expert  system  to  be  run  on  a  computer.  A  similar  view 
has  been  expressed  in  the  artificial  intelligence  literature 
by  Doyle  (1984). 


4.  Statistical  consulting  as  a  model. 

A  few  words  are  in  order  concerning  knowledge  about 
statistical  consulting  as  a  basis  for  expert  systems  in 
data  analysis.  The  questions  of  what  facts  consultants 
know,  what  heuristics  they  employ,  and  the  strategies 
that  they  adopt  are  all  understudied  problems.  There 
has  been  a  surge  of  interest  within  the  statistical  com¬ 
munity  in  the  last  five  years  on  the  topic  of  teaching  sta¬ 
tistical  consulting,  and  the  resulting  reflections  on  the 
process  of  statistical  consulting  are  valued  contributions 
to  this  secondary  endeavor  of  building  consulting  exper¬ 
tise  into  usable  computer  systems.  At  the  same  time, 
the  emphasis  has  been  more  on  apprenticeship  and 
supervision  of  trainees  rather  than  on  the  special  skills 
that  expert  data  analysts  have  and  how  those  skills 
might  be  transmitted.  We  know  of  no  study,  thorough 
or  otherwise,  of  the  process  by  which  successful  consul¬ 
tants  in  data  analysis  approach  their  work  and  achieve 
their  results. 


This  said,  we  can  begin  to  outline  the  areas  in  which 
research  is  likely  to  be  fruitful.  Data  analysts  consulting 
with  scientists  expert  in  their  (ground)  domain  are 
general-purpose  scientific  detective/psycho-analysts. 
They  proceed  by  asking  questions,  and  often  these  ques¬ 
tions  are  suggested  by  what  they  see  in  the  data.  The 
answers  to  these  questions  lead  both  to  alternative  ways 
of  looking  at  the  data  and  to  new  questions.  The  impor¬ 
tant  work  of  the  consultant  seems  to  get  done  through 
the  questions  he  asks  of  the  client.  It  is  important,  then 
to  investigate  how  these  questions  are  structured,  what 
plans  of  inquiry  are  adopted,  and  what  it  is  that  leads  to 
the  formulation  of  these  plans. 

The  natural  way  to  learn  about  these  issues  is  to  observe 
experts  at  work  (as  knowledge  engineers  do),  perhaps 
even  to  conduct  experiments  involving  them.  Some 
years  ago,  I  received  a  telephone  call  from  a  colleague  in 
pediatric  neurology;  he  had  a  quick  question.  “I  can't 
remember,’'  he  said,  “whether  1  should  use  standard 
deviation  or  standard  error.  Which  do  you  suggest?” 
We  began  to  talk,  and  over  the  course  of  a  few  weeks,  it 
became  clear  that  the  answer  was,  “None  of  the  above.” 
We  ultimately  used  a  three-factor  unbalanced  mixed 
model  with  covariates-and  we  learned  more  about  the 
disease  process  under  study  by  doing  so.  Unfortunately, 
I  have  no  idea  what  sequence  of  events  led  from  the 
innocuous  question  on  his  part  to  the  ultimately  more 
complex  solution  at  which  we  arrived.  This  is  the  process 
which  requires  scrutiny  and  study. 

S.  Representing  knowledge  about  question-asking. 

What  must  be  considered  in  building  a  concrete 
representation  for  the  knowledge  about  question-asking 
that  data  analysts  seem  to  possess  and  use  to  such  good 
effect?  Questions  are  asked  both  of  the  data  and  the 
expert  in  the  ground  domain.  These  questions  often 
alternate,  the  data  suggesting  questions  to  ask  of  the 
client,  whose  response  suggest  new  questions  to  ask  of 
the  data.  We  can  distinguish  four  levels  of  questioning: 
asking  questions  of  the  data,  asking  questions  of  the 
experts,  using  answers  to  formulate  questions,  and  ask¬ 
ing  questions  about  questioning.  We  now  turn  to  just 
the  first  of  these  levels,  as  it  is  the  level  which  we  are 
currently  closest  at  being  able  to  explicate.  Some  of  the 
issues  raised  in  the  remainder  of  this  section  are  dealt 
with  more  thoroughly  in  Thisted  (1985). 

“Asking  questions  of  the  data”  can  be  broken  down  into 
three  rough  stages  which  together  describe  a  single  step 
in  the  analysis  of  a  data  set:  focus,  selection,  and 
transformation /reassessment.  In  focusing  the  analyst 
concentrates  on  a  relatively  small  subset  of  the  data, 
perhaps  a  handful  of  variables  (or  cases)  of  interest  at 
the  moment.  Selection  is  the  process  by  which  a  collec¬ 
tion  of  appropriate  transformations  of  the  data  is 
identified;  transformation  here  meaning  nearly  any  com¬ 
putation  on  the  data  set,  including  computing  a  regres¬ 
sion  (producing  estimated  coefficients,  fitted  values,  resi¬ 
duals),  computing  and  displaying  a  scatierplot  of  two 
variables,  constructing  a  confidence  interval,  etc. 


Finally,  ir  ana  formation  and  rtaaataoment  is  the  process 
of  carrying  out  the  computation,  and  then  reassessing 
the  situation.  Reassessment  may  lead  to  a  change  of 
focus,  to  a  change  in  the  class  Of  appropriate  transforma¬ 
tions,  or  to  new  questions  of  the  client. 

6.  On  carts  and  horses. 

Bill  Eddy’s  opening  remarks  were  entitled  ^'Artificial 
intelligence  in  statistics:  Do  we  have  the  cart  before  the 
horse?”  This  provocative  title  prompts  a  few  observa¬ 
tions  about  the  A1  cart  and  the  statistical  horse. 

There  is  no  cart.  It  should  be  clear  from  the  outset 
that  expert  systems  for  general  use  in  data  analysis  don’t 
exist,  although  a  few  demonstration  systems  have  been 
built.  Moreover,  there  is  no  general  methodology  for 
building  expert  systems.  And  at  least  for  the  kinds  of 
systems  we  have  been  discussing,  there  are  no  general- 
purpose  expert  systems  of  any  kind  which  incorporate 
the  higher-level  meta-knowledge  of  a  domain  which 
interacts  with  a  variety  of  ground  domains. 

There  is  no  horse.  What  makes  a  particular  data 
analysis  a  good  one  is  little  studied-and  even  less  under¬ 
stood.  At  the  moment,  we  teach  data  analysis  and  con¬ 
sulting  by  example,  and  we  hope  that  some  of  it  will  rub 
off  on  our  students. 


grams  before  we  appreciated  the  role  of  data  structure, 
top-down  construction,  information  hiding,  loop  invari- 
uits,  and  the  rest,  indeed,  much  of  what  we  know 
about  these  ideas  was  learned  through  reflection  on  what 
made  some  programs  better  than  others  and  some  pro¬ 
grammers  better  than  others.  Even  if  no  generally  use¬ 
ful  expert  systems  arc  built,  we  may  still  make  great 
strides  in  improving  the  general  quality  of  data  analysis 
because  we  better  understand  what  goes  into  data 
analysis  of  high  quality,  so  that  we  can  convey  it  mor 
directly  and  more  successfully  to  budding  data  analysts. 

At  the  same  time,  much  of  expert  systems  work  is 
closely  related  to  what  we  think  data  analysts  actually 
do.  Both  good  data  analysis  and  successful  knowledge 
engineering  involve  drawing  out  an  expert,  evoking  what 
he  knows  but  does  not  say  about  a  problem.  Doth  the 
statistical  consultant  and  the  knowledge  engineer  must 
be  adept  at  asking  the  right  question  which  brings  into 
focus  the  critical  aspect  of  what  is  being  done.  Thus, 
work  in  expert  systems  for  data  analysis  may  well  bring 
new  paradigms  for  knowledge  articulation  to  the  atten¬ 
tion  of  workers  in  AI,  and  at  the  same  time  may  help  to 
make  the  techniques  of  knowledge  engineering  needed  to 
construct  general  expert  systems  more  readily  available. 
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We  need  both  carte  and  horses  (in  either  order). 
The  combination  of  the  two  is  certainly  more  useful  than 
either  separately.  What  is  more,  understanding  horses 
may  help  us  to  mass-produce  carts,  and  v^ce-ver^t.  A 
better  understanding  of  useful  heuristics  and  successful 
strategies  (from  expert  systems)  will  lead  to  improve¬ 
ments  in  statistical  teaching  and  practice.  What  is 
more,  with  even  rudimentary  expert  data-analysis  sys¬ 
tems,  the  human  experts  can  be  reserved  for  the  impor¬ 
tant  problems,  since  there  are  so  many  problems  and  so 
few  experts. 

Neither  cart  nor  horse  may  be  possible.  This  is  a 
fact,  and  we  must  live  with  it.  Hut  many  useful  things 
have  been  learned  by  striving  for  the  impossible.  Hence, 

We  must  attempt  to  build  both  carts  and  horses. 

There  is  much  to  be  learned  solely  from  the  attempt. 
Except  perhaps  for  John  Tukey’s  personal  tour-de-force 
(Tukey,  1977)  which  records  what  Tukey  senses  from  his 
own  experience  and  reflection  to  be  important  and  useful 
strategies  and  techniques  for  data  analysis,  there  has 
been  no  serious  attempt  to  represent  what  data  analysts 
do,  and  hence,  what  knowledge  they  possess. 

We  cannot  wait  until  data  analysis  is  more  fully  under¬ 
stood  to  begin  work  on  expert  systems  for  data  analysis, 
primarih  'because  there  is  not  much  work  going  on  try¬ 
ing  to  understand  what  it  is  that  good  data  analy.st's  do, 
and  how  it  can  be  taught.  The  major  benefit  from  work 
in  expert  systems  for  data  analysis  may  well  be  a  better 
understanding  of  the  process  of  data  analysis.  It  is  use¬ 
ful  to  recall  a  brief  bit  of  recent  history  We  wrote  pro¬ 
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PRODUCTION  SYSTEMS  AND  BELIEF  FUNCTIONS 
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Expert  systems  are  computer  programs  which  use  domain-specific  knowledge  to  make  inferences  about 
problems  arising  in  that  domain.  Some  expert  systems  must  handle  knowledge  which  is  uncertain,  and  a 
popular  tool  lor  handling  such  uncertain  knowledge  has  been  the  adhoc  uncertainly  factors  found  in  MYCIN. 
We  explore  another  tool,  belief  functions,  introduced  by  Art  Dempster  and  Glenn  Shafer. 


1 .  Production  Systems 

Suppose  a  customer  wants  to  buy  a  VAX  computer.  He  has 
some  idea  of  what  he  wants:  his  VAX  should  have  so  much 
disk  space;  il  sliould  support  so  many  iiiicoin  lines:  he  wants  it 
to  connect  to  this  kind  ol  tape  drive  and  that  kind  of  printer: 
and  so  on.  However,  ttiere  are  still  many  details  that  need  to 
be  decided.  What  kind  ol  wires  should  be  used  to  connect  this 
to  that?  What  kind  ol  boards  are  necessary?  Tire  customer 
needs  a  VAX  expert  to  insure  that  the  order  is  consistent  and 
complete. 

Actually,  DEC  has  a  computer  program  that  configures  VAX’s. 
I  he  program,  called  R1 ,  uses  production  i  ulos.  An  example  ol 
a  production  rule  might  be: 

DISTRIBUTE  MO  DEVICES.3 

IF: 

THE  MOST  CURREtIT  ACTIVE  COM  TEXT  IS 
DISTRIUUTING  MASSOUS  DEVICES 

AND  THERE  IS  A  SINGLE  FORT  DISK  DRIVE  THAT 
HAS  NOT  BEEN  ASSIGNED  TO  A  MASSBUS 

AND  THEBE  ARE  NO  UNASSIGNED  DUAL  PORT 
DISK  DRIVES 

AND  THE  NUMBER  OF  DEVICES  THAT  EACH 
MASSBUS  SHOULD  SUPPORT  IS  KNOWN 

AND  THERE  IS  A  MASSBUS  THAT  HAS  BEEN 
ASSIGNED  AT  LEAST  ONE  DISK  DRIVE  AND 
THAT  SHOULD  SUPPORT  ADDITIONAL  DISK 
DRIVES 

AND  THE  TYPE  OF  CABLE  NEEDED  1 0  CONNECT 
Tf  IE  DISK  DRIVE  TO  THE  PRE  VIOUS  DEVICE 
ON  THE  MASSBUS  IS  KNOWN 

THEN: 

ASSIGN  THE  DISK  DRIVE  TO  THE  MASSBUS 


A  rule  contains  a  left-hand-side  (LHS)  and  a  right  hand-side 
(RHS)  Tl  le  l.l  IS  is  a  set  cl  conditions  which  must  bo  satisfied 
before  tire  conclusions  or  actions  in  the  RHS  can  be  accepted. 
To  gel  an  idea  of  how  this  production  program  might  run, 
suppose  that  eaci)  customer  order  results  in  a  meeting.  At  the 
meeting  are  representatives  of  the  rules  (one  represenlative 
lor  each  rule),  a  secretary,  and  an  arbiter.  The  secretary 
begins  by  writing  lire  specifications  ol  the  customer  order  on 
the  blackboard;  each  representative  watches  carefully  to  see  if 
the  LHS  ol  his  particular  rule-  has  yet  been  satislicd  by  the 
specifications  on  Hie  blackboard;  when  a  representative  sees 
his  rule  satisfied,  he  signals  the  arbiter;  more  than  one  rule 
can  be  satisfied  at  any  one  instant,  so  the  arbiter  must  decide 
which  ol  the  satisfied  rules  can  "lire";  the  secretary  changes 
the  specifications  on  tlie  blackboard  according  to  the  RHS  ol 
the  fired  rule.  As  more  rules  are  fired,  Ihe  blackboard  changes 
and  other  representatives  find  their  rules  satisfied.  For  each 
set  ol  conditions  on  Ihe  blackboard,  a  represenlative  can  have 
his  rule  fired  at  most  once.  The  meeting  continues  until  no 
representalive  finds  his  rule  salislied.  The  blackboard  at  the 
end  ol  the  meeting  describes  the  completed  specifications  of 
Ihe  customer  order. 

In  R1,  Ihe  rules  and  conditions  are  assumed  to  be 
deterministic.  Either  the  customer  wants  a  printer  or  ha 
doesn't.  Given  that  he  wants  a  printer,  he  may  or  may  not 
need  this  kind  ol  board,  but  il  we  have  enough  conditions 
about  what  he  wants,  vie  0,10  be  quite  sure  of  what  kind  of 
board  he  needs.  In,  say,  a  medical  diagnosis  problem,  we  are 
often  not  sure  il  Ihe  patient  has  a  particular  set  ol  symptoms. 
Also,  deterministic  rules  are  harder  or  impossible  to  obtain. 
We  cannot  say  that  a  person  with  this  list  of  symptoms  Is 
surely  lo  have  this  disease.  The  best  we  can  say  Is  that  given 
these  symptoms,  Ihe  person  is  likgly  lo  have  this  disease.  The 
problem  then  becomes  that  ol  expressing  and  reasoning  with 
these  uncertainties. 

The  computer  scientists  are  convinced  that  using  probabilities 
is  loo  hard  if  not  impossible,  so  they  have  turned  to  rather 
adhoc  procedures,  such  as  the  certainty  factors  found  In 
MYCIN.  Recently,  however,  some  computer  scientists  have 


discovered  belief  functions,  which  were  first  proposed  by 
Dempster,  and  later  developed  by  Shafer.  (See  Shafer  (1976).) 
Belief  functions  are  appealing  to  the  computer  scientists 
because  they  are  less  restrictive  than  probabilities,  they  can 
express  ignorance,  and  they  have  some  mathematical 
backing  Several  arlilicial  intelligence  groups  are  Irying  to 
impleniont  belief  lunctiotrs  into  their  expert  systems,  and  this 
is  reason  enough  lor  slatislicians  to  become  actively  involved 
in  belief  functions  for  expert  systems. 


2.  A  Small  example 

To  introduce  the  ideas  of  beliel  functions  and  their 
relationships  to  production  systems,  we  will  consider  the 
following  liny  example.  This  is  not,  of  course,  a  real  expert 
system,  but  it  uses  if-then  rules  to  hel|)  obtain  a  desired 
conclusion. 


Figure  1. 

A  =  "It  rained  on  Monday." 

B  =  "The  gardener  had  muddy  boots." 

C  ■=  "The  door  was  unlocked." 

D  ■  "The  footprints  belong  to  the  gardener." 

E  =  "The  fingerprints  match  those  in  the  gardener’s  toolshed.’ 
F  «  "The  gardener  is  guilty." 

r,:A.>B 
r^ :  B  &  C  •>  D 
r3:D.>F 
r  ■E  ->F 


C  &  B 


Suppose  I  go  away  on  a  trip  for  a  week.  During  that  time,  I  am 
forced  to  leave  my  house  unoccuiiied  and  unguarded.  Upon 
my  reltirn,  I  discover  that  the  television  set  is  missing.  I  also 
notice  that  there  are  drie<l  up  muddy  footprints  loading  to  and 
from  the  back  door.  Who  was  the  thief? 

The  house  is  surrounded  by  clean  sidewalks,  so  an  ordinary 
irasserby  would  not  have  ,  had  muddy  boots  urtloss  he  had 
been  w.ilking  in  the  garden  and  unless  the  garden  was  wet.. 
An  idea  flashes.  Maybe  it  was  the  gardener.  When  I  left  on 
Sunday,  the  garden  was  dry.  The  gardener  comes  on 
Monday.  Therefore  I  construct  the  rule;  If  it  rained  on 

I  don'l  know  my  gardener  very  well,  but  I  have  the  feeling  that 
he  is  not  a  professional  thief.  He  would  not  have  entered  the 
house  had  the  door  been  locked.  I  construct  another  rule;  1| 
the  tiardener  had  muddy  boots,  arrd  the  door  w;is  unlocked, 
then  the  foolorirrts  trelono  to  the  gardener.  Another  rule  that 
obviously  follows  is:  If  the  fonlnrints  belong  to  the  oardener, 
then  the  gardener  is  auillv. 

I  might  have  some  other  evidence  that  corroborates  with  the 
foolprint  evidence.  Fiiiger  prints  are  found  in  the  house  that 
do  not  match  any  of  the  fingerprints  of  the  members  in  my 
family.  I  construct  one  more  rule:  If  the  finnorprints  match 
those  in  the  gardener's  toolshed.  then  the  oardener  is  guilty. 

Figure  1  summarizes  the  four  rules. 

We  emphasize  here  that  we  are  allowing  lor  the  possibility  that 
each  of  these  rules  needs  not  have  100  percent  certainty  of 
holding  Even  though  it  was  raining  on  Monday,  wo  allow  the 
possibility  that  the  gardener  did  not  have  muiltly  bools. 
Perhaps  it  was  r, lining  so  harrt  that  ho  decided  to  wail  until 
Thursday  to  work  on  the  garden.  Also  there  is  uncertainty  on 
the  lell  hand  side  conditions.  The  weather  leports  in  the 


newspaper  are  perhaps  somewhat  reliable,  but  inispiinls  are 
possible;  and  I'm  not  sure  whether  I  checked  the  back  door 
before  I  lell  because  it  is  used  so  inirequenlly. 

Our  goal  is  to  quantify  our  uncertainties  both  of  the  lell  hand- 
side  conditions  lond  of  the  rules  and  then  be  able  to  calculate 
the  resulting  uncertainties  of  the  right  hand-side  conditions. 
That  is,  if  we  have  a  measure  ol  beliel  on  A  and  a  measure  of 
beliel  on  the  rule  A  ->  B.  then  what  is  our  measure  ol  beliel  on 
B? 


3.  An  Introduction  to  Belief  Functions 

The  matcri.al  in  this  section  is  from  Shafer  (1976)  and  Shafer 
and  Tversl'.y  (1084)  A  frame  ol  discernment  0  is  a  set  ol  all 
possibilities  under  consideration.  For  example,  if  we  were 
concentrating  just  on  the  question  of  whether  or  not  A  were 
true,  we  might  consider  the  frame 

0,  =  {a„,a,), 

where  a^  denotes  "A  is  not  true",  and  a^  denotes  "A  is  true". 
The  frame  ol  discernment  can  be  much  more  complex  of 
course.  For  example,  if  we  were  concentrating  on  the  rule 
A  ->  B,  we  would  consider  the  frame 

0^3  =  {(a,b):a  =  0,l,b  =  0,1),  (1) 

where  a  =  0  or  1  according  to  whether  A  is  false  or  true,  and 
similarly  lor  b  and  B. 

Just  as  it  is  easier  to  introduce  probabilities  through 
probability  density  functions,  it  is  easier,  here,  to  introduce 
beliel  functions  through  basic  probability  tir.sirinmcnts.  Shafer 
defines  the  function  m  :  2*^  .>  [0,1]  to  be  a  basic  probability 
assignment  if 


% 


m(<t>)  =  0, 

I  m(A)  =  1. 

ACO 

We  call  the  subsets  A  ot  O  which  have  positive  n»  value 
dssH)aini‘nis,  ru{A)  >  0.  tt»e  fp<;al  clomeots  of  n>.  The  belief 
lu'iU'ULV  Bel  2“  >  10.  t|.  associated  with  m  is  defined  by 

Del(A)  =  1  m(B). 

BCA 

list  vissiny  the  propoities  or  inlerptetalions  of  basic 
proo  assignin»?nts  ami  belief  functions,  let  us  consider  a 
swiipit'  nun.*.*»ical  example: 

"  =  "a)’ 

=  .32 
=  08 
=  .48 
m(())  =  .12. 

MoiK  I  111  il  unlike  [irobubilily  density  functions,  the  domain  of 
m  IS  nut  r*  stiicted  to  singtelons.  also  the  focal  eteincnts  are 
not  1 1'  loint  We  think  of  our  belief  03  being  divisible  into 
chunks  The  m  above  says  we  pul  32  of  our  belief  on  6^:  .08 
of  lii'i  mars  is  "ting  to  wander"  among  the  elements  in 
{tf  I  f/^|.  and  so  on  A  mass  which  is  free  to  warKfer  on  aft  of  O 
leflei. Is  ignorance.  The  more  specific  a  chunk  of  mass  is,  the 
moie  in'onnal'on  il  redecls.  An  m  function  which  puts  all  its 
mass  on  ()  rcileots  total  ignorance  An  m  lunction  which  puls 
all  its  mass  on  a  singleton  reflecls  total  certainty  for  that 
singleton. 

The  m  lunction  describes  the  measure  of  belief  that  we 
commit  nxaeJii  to  each  set;  the  total  amount  of  belief 
cominillod  to  each  sot  is  described  by  the  associated  belief 
function: 

Bel{tf,)  =  .32 
BeH»,,0,)  =  .40 
Bel{l9,,(?3)  =  .80 
Bel(O)  =  f.OO 

For  example,  the  belief  on  gotten  my  adding  the  m 

on  {flj)  to  the  m  on  {0^,0^),  these  two  sots  being  llie  subsets 
of  wit.h  positive  m  values.  The  mass  .40  is  Iho  amouni 

of  m.ass  that  is  confined  somewhere  in  and  it 

represents  our  total  belief  on  (0^0^). 

Wlial  do  Hie  numbers  32  or  .00  mean?  In  answer  to  tliis 
gnoslion,  Shafer  and  Tversky  (1904,  p,  23)  (iropose  some 
ll'.nuglit  ejipijrlniojits  Tlie  simplest  involves  simple  support 
fill H. lions  witose  m  functions  fiave  tfie  form 
m(A)  =  s, 
m(())  =  Is, 

Viliero  A  in  a  subset  ol  f)  and  0  <  s  <  1.  Wo  describe  such  a 
I;,  lief  function  as  the  sinipte  sup|)orl  function  with  mass  s  on 
ill  Ifxal  element  A  Simple  support  functions  result  from  a 
ii'iice  of  evidence  that  oilers  suiiport  lor  a  ringlo  subset  A.  For 
I  -ample,  in  Hio  gardener  example,  il  we  are  concentrating  our 
.'.llet.iion  on  wheiher  or  not  A  is  true  (so  that  v/e  are  looking  at 


the  frame  of  discernment  O^),  the  newspaper  reporting  rain  on 
Monday  should  give  some  support  on  (a^).  The  amouni  ol 
support  depends  on  tho  lollowing  thought  experiment. 

Imagine  a  somolimes  reliable  truth  machine.  In  its  "truth" 
mode,  il  tells  the  truth,  but  in  its  "unreliable  mode"  it 
generates  totally  random  statements  whicli  give  us  no  added 
information.  Tlie  probability  of  being  in  Hie  truth  mode  is  s, 
while  the  piobabilily  ol  being  in  the  unreliable  mode  is  I  s. 
Tfie  truth  machine  spouts  out  "A  is  true".  Sliafer  and  Tversky 
propose  that  Hie  resulting  belief  lunction  should  be  the  simple 
support  (unction  willi  mass  s  on  the  focal  clement  A.  In  tlie 
gardener  example,  we  think  ol  the  newspaper  as  the 
sometimes  rcliable  truth  macliino  with  probability  s  of  telling 
the  truth,  and  probability  I  s  of  printing  a  lotally  random 
statement.  The  newspaper  reporting  "Rain  on  Monday"  leads 
to  a  simple  support  function  with  mass  s  on  the  focal  element 
{a,}. 

Rhaler  and  Tversky  propo.se  other  thought  experiments  tor 
belief  functions  which  are  more  complicated  tlian  simple 
support  functions.  We  will  not  describe  them  here. 

It  may  turn  out  that  we  have  another  piece  ol  evidence  for  rain 
on  Monday.  A  neighbor  recalls  that  it  rained  on  Monday.  We 
v/ould  like  to  combine  our  evidence  supplied  by  the 
newspaper  with  that  supplied  by  the  neiglibor.  We  will  use 
Dempster's  combination  rule.  Given  the  basic  probability 

itssignments  nij  with  local  elements  Aj . A^^  and  with 

local  elements  . B^,  if  K,  given  by 

K  =  (t  .  i;  m,(A.)m3(B|))  ', 

A^nQ,  =  If) 

is  positive,  then  Hie  beliol  lunction  resulting  Iroin  the  Dempster 
combination  has  m  lunction  ni  =  m  ^  4“  m defined  by 
m(A)  =  K  >;  niitA  jiiijfB.). 

A  iTB  =  A 

I  I 

The  formula  appeals  moro  complicated  than  the  concept.  To 
illustrate,  suppose 

0.(0^.  0^.0^}. 
in,l0,.0^}  -  .40 
IT1,(1))  =  .60 

mjIffj.Oj)  =  .08 

mj(0)  =  .20 

The  Denipfitcr  combination  m  «  Is  easily  gotten  by 

considering  the  (ollowing  table. 


{9, ,93) 

e 

.40 

.60 

{9, ,93) 

{9,} 

{9, ,93) 

.80 

.32 

.48 

9 

0 

.20 

.80 

.12 

Since  there  are  no  null  intersections  of  local  elements,  K  =  1. 
Also  each  intersection  ol  a  local  element  ol  with  a  local 
element  ol  m^  leads  to  a  distinct  set,  so  we  can  just  read  oil  m 
from  the  body  of  the  table; 

m{fl,}  =  .32 
m{9,,flj}  ■=  .08 
m{fl|,flj}  =  .48 
m(G)  =12 

It  is  instructive  to  compare  the  beliel  lunctions  ol  m^,  m^,  and 
m; 

Bel,(0)  =  1.00 


BeljCAj.lJj}  »  .80 
Qclj(O)  =  1.00 

Bel(tf,)  =  .32 
Bel{9,,tfj)  =  .40 
BeUfl,.^  =  .60 
Bel(O)  =  1.00 

iSinco  =  Bel{9,,9j).  Ihe  belief  on  {9,,9j)  has 

'ntmained  constant,  hut  in  Bel,  some  ol  the  mass  that 
contributes  to  total  beliel  on  (9, .9,)  is  constrained  to  lie  in 
(9,). 

Continuing  with  the  gardener  example,  suppose  that  we 

believe  that  Ihe  newspaper  gives  Ihe  simple  support  function 

m  ta  )  =  .6 

m  (O.)  =  .4 
ii€ws  A' 

(the  newspaper  is  not  very  reliable),  and  the  m?i(jhbor  gives 
simple  support  function 


m  . ,  {a,)  =  .3 

neighbor '  1 ' 

m  .  (O  )  =  .7 

neighbor  A 

(our  neighbor  is  old  and  often  forgets  the  day  of  the  week). 


{a,) 

.60 


(a,) 

.30 


(a,) 

.18 


{a,) 

.12 


9 

.40 


(a,) 

.42 


As  in  the  example  above,  K  =  1.  However  the  set  (aj)  is  the 

result  of  several  distinct  intersections  ol  local  elements  ol 

m  with  local  elements  of  m  ,  . ,  and  so  gelling  the 
4>ew9  r>CK)n  ^ 

Dempster  combination  m  =  m  .®m  .  ..  requires  a 
^  news  neighbor 

summation: 

m{a,}  =  .18  +  .42  +  .12  =  .72 

m(e^)  =  .28 

The  corroborating  pieces  of  evidence  have  resulted  in  a  fairly 
high  support  for  {aj),  even  though  Ihe  individual  pieces  ol 
evidence  each  gave  only  meager  support. 


4.  Belief  functions  for  uncertain  rules 

Of  the  goafs  slated  at  Ihe  end  of  Section  2,  we  have  discussed 
methods  to  attain  our  first  goal,  to  quantify  our  uncertainties  in 
the  left  hand  side  rnndilions.  We  now  consider  the  second 
and  third  goats,  to  quantify  our  uncertainties  ol  tlie  rules 
themselves  and  to  propagate  the  uncertainties  to  the  right- 
hand-side  conditions.  Let  us  locus  on  the  rule  r^  :  A  ■>  B.  Our 
relevant  frame  ol  discernment  is  defined  in  (1).  Since  r^ 
is  logically  equivalent  to  the  elements  in  the  set 
{(0,0),(0.1),(1,1))  being  true,  it  seems  reasonable  lo  represent 
our  beliel  on  the  rule  r^  by  a  simple  support  function  with  focal 
element  {(0,0),(0,1),(1,1))  : 

m,  {(0,0),(0  1),(1,1))  =  p  (2) 

1 

■"r/eAe)  =  ’  •  Pr,- 

How  shoulrl  wo  interpret  lire  mass  that  we  assign  to  llie 
local  element  ((U,0),(0,1),(1,t))?  To  altswer  this,  we  need  to 
see  how  our  brrliol  on  A  propagates  llirough  our  beliel  on  Ihe 
rule  lo  give  a  beliel  on  B. 

In  Section  3.  when  wo  worn  considering  evidence  on  A,  we 
restricted  our  attention  to  the  frame  of  discernment  O..  Now 
considering  tlie  rule  A  ->  B.  we  liave  a  diderent  frame  0^^. 
Actually  the  tvro  frames  arc  nut  unrelated.  1)^  is  a  coarsening 
ol  The  elomcnis  in  can  be  pul  into  a  one  lo  one 

correpondence  with  a  partition  ol  the  nlornents  in  The 

correspondence,  whicli  vro  denote  by  equality,  is 

a„  n  ((0,0), (0,1)), 
a,  =  ((1,0), (1,1)). 

Notice  that  hotli  sides  in  Ihe  first  equation,  a^  and  {(0,0), (0,1)), 
represent  A  being  lalse,  and  botli  sides  in  the  second 
equation,  and  {(1,0), (1,1)),  represent  A  being  true.  The 
belief  lunction 

"Via,)  =  Pa  (3) 

m^(O^)  =  1  ■  p^ 

tJefined  on  llio  subsets  ol  can  be  considered  eqiriyalent  to 
Ihe  beliel  lunction 

m^{(1,0),(1,)))  =  p^, 

"'A(f\B>  =  '  •  Pa' 

defined  on  Ihe  subsets  ol 


O 

.40 


e 

.28 


To  propagate  our  belief  on  A,  which  Is  described  by  ni^, 
through  our  belief  on  the  rule,  which  Is  described  by  ,  we 
can  simply  usu  the  Dempster  combination  m  =  m,  6  m  . ' 


m 

r, 


((n,0).(o,i),(t.i)) 

0 

P, 

'  P-, 

(n  111  11  1)) 

((I.D) 

{(l.0).(l,1)) 

p» 

P.P, 

PaC  P,,) 

* 

u 

((0,0)(0.i).(1,l)) 

e 

‘  p 

n  p,)p, 

(1  Pa)(I  Pr) 

The  fosuiting  m  function  can  be  easily  read  from  the  table. 
I  ••tiiiw)  Rc<  be  the  corresportding  belief  function,  we  are 
interested  m 

BelCBisIruo-)  -  Bcl{(0,t),(t,t))  (4) 

-  m((l,t)) 


Ws  lelutii  now  to  the  interpretation  of  the  number  . 
Suppose  that  we  are  absolutely  sure  that  A  is  true.  This  lead^ 
to  in^  defined  in  (3)  with  =  >.  and  substiluting  this  value 
into  llio  (4)  gives  BeU'  Ei  is  true")  =  p^.  Therefore,  p^  is  our 
belief  on  0  if  wo  are  absolutely  sure  that  A  is  true.  ' 


very  small  example.  The  combination  rule  potentially  Involves 
intersections  and  multiplications  of  all  subsets  of  the  frame? 
In  a  large  problem,  how  do  we  handle  the  computational 
explosion? 
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Up  to  this  point,  we  have  been  concentrating  on  the  rule 
tj  :  A  >H.  This  is  tlie  bottom  right  branch  of  the  tree  in  Figure 
1  Given  some  evidence  on  A  and  some  belief  on  tlie  rule  Tj, 
we  have  calculated  a  beliel  p^p^  on  B.  We  can  take  this  belief 
on  B,  combine  it  with  evidence  &n  C  and  belief  on  r^  to  get  a 
belief  P^Pj^P,  P,  on  D.  In  turn,  this  beliel  on  D  together  with 
belief  on  r^  gi\re§  belief  P^PfP,  P,  P,  on  F.  Also,  evidence  on 
E  together  with  belief  on  the/  rSle^r^  gives  additional  and 
independent  beliel  p^p^  on  F.  Comtiining  tlieso  two  pieces  of 
support  on  F  gives  a  totifl  belief 


PaPpP,  P,  Pr  +  PsPr  •  (PaPpP,  P,  P,  )<P-iP,  > 
^  p  'i  '2  'a  ^  '4  p  "1  'a  "a  ^  '4 


5.  Discussion 

We  have  seen  a  very  simple  and  rather  tentative  introduction 
into  production  systems  and  beliel  functions.  The  hope  here 
was  a  germ  from  whicli  grow  deeper  thoughts  about  the 
problems  of  dealing  with  uncertainty  in  expert  systems.  There 
are  many  questions  that  need  to  be  addressed:  Is  the  belief 
function  m^  chosen  in  (2)  of  the  approprifitn  form  lor 
reflecting  beliefs  on  rules?  The  combination  rule  requires  that 
the  two  beliel  functions  entering  into  the  combination  be 
based  on  independent  evidence.  How  do  we  handle 
depenilent  pieces  of  evidence?.  We  have  only  considered  a 


COMPUTER  GRAPHICS:  STATE  OF  THE  ART  FOR  DATA  ANALYSIS 


R.J.  Littlefield 

Battel le  Northwest 
Richland,  WA  993S2 

The  field  of  computer  graphics  (CG)  is  now  over  20  years  old.  In  that  time, 
a  rich  variety  of  techniques  have  been  developed  for  graphical  display  and  interaction 
These  techniques  have  been  applied  to  such  diverse  areas  as  computer  aided  design  and 
manufacturing,  flight  simulation,  advertising,  big-budget  movies,  video  games,  and  of 
course,  data  analysis.  Compared  to  other  applications,  the  CG  techniques  used  for 
data  analysis  are  usually  quite  primitive.  This  presentation  surveys  the  current 
capabilities  and  limitations  of  CG,  discusses  how  these  affect  its  application  to 
data  analysis,  and  suggests  ways  in  which  more  sophisticated  CG  techniques  could  be 
applied  to  data  analysis.  Particular  emphasis  is  given  to  graphical  interaction  and 
the  role  of  workstations. 


Graphics  for  Specification: 

A  Graphic  Syntax  for  Statistics 


Paul  F.  VeUeman 
Cornell  Univenity 

The  Data  Desk  is  a  full-function  statistics  package  for  the  Macintosh  personal 
computer.  It  employs  a  new  graphic-based  syntax  for  specifying  statistics  operations  and 
data  manipulation.  This  article  describes  the  principles  behind  the  design  of  this  interface  and 
discusses  some  of  the  consequences  of  this  design. 


Computer  graphics  have  traditionally  been  important  parts 
of  statistical  analyses.  (Of  course,  "traditional  graphics"  in 
statistical  computing  means  "used  for  a  decade  or  more  by 
those  who  could  afford  the  hardware.")  Graphics  were 
used  primarily  for  presentation  of  results  and  as  tools  in 
analyzing  data. 

With  improving  technology  came  animation  and  interactive 
control  of  graphics.  These  were  great  advances  in 
principle,  but  the  only  contact  most  d'U  analysts  had  with 
them  was  watching  video  tapes  and  movies  enviously  at 
conferences. 

Recently,  interactive  graphics  have  begun  to  come  out  of 
the  laboratory.  We  are  seeing  more  displays  in  which  the 
viewer/analyst  interacts  in  real  time  with  the  display.  For 
example,  PRIM'S  of  various  kinds  and  origins.  Brushing 
scatterplots,  and  other  ways  to  perceive  higher  dimensions 
aie  becomming  mote  widespread. 

There  has  also  been  a  growing  interest  in  the  graphical 
control  of  computer  operating  systems.  The  most 
widespread  (and  one  of  the  cheapest  at  today's  prices)  is 
found  on  the  Apple  Macintosh  personal  computer.  The 
ideas  behind  the  Macintosh  operating  environment  are  by 
no  means  new,  but  in  the  Mac  they  have  been  made 
accessable  and  affordable. 

Jerry  Lefkowitz  and  I  have  been  engaged  in  a  project  to 
develop  a  statistics  environment  that  uses  graphic  control 
as  the  means  of  communication  between  the  data  analyst 
and  a  statistics  program.  The  program  is  called  The  Data 
Desk,  and  is  currently  running  on  a  Macintosh  computer. 
This  article  is  the  initial  report  on  that  project 


Graphics: 

For  the  purposes  of  this  discussion,  I  define  graphics  in  a 
very  general  way. 

•  Any  display  whose  meaning  or  function  relies  to  some 
important  degree  on  the  physical  position  of  things  on  the 
screen  (rather  than,  for  example,  on  the  numerical  value  or 
verbal  meaning  of  things  on  the  screen)  I  will  include 
under  the  rubric  of  graphics.  This  means  that  if  an 
operation  is  performed  by  pointing  to  a  word  rather  than 
typing  it,  I  consider  it  to  be  a  graphic  operation.  If  the 
word  moves  on  the  page,  or  is  made  to  appear,  or 
disappear,  or  change  font  or  style,  1  consider  that  a  graphic 
operation.  One  reason  for  this  eclectic  deHnition  is  that  I 
can  see  no  reasonable  way  to  draw  a  line  between  graphic 
symbols  that  happen  to  be  numerals  or  letters  and  other 
graphic  symbols.  The  definition  is  thus  an  operational  one; 
if  it  is  used  like  a  graph  then  it  is  a  graph  (even  if  it  looks 
like  text  at  a  glance). 

The  Environment: 

We  have  implemented  this  design  on  a  Macintosh 
computer.  The  relevant  technical  specifications  ate: 

•  Graphics  hardware:  A  high-resolution,  fast,  monochrome 
graphics  screen  (372x512  pixels).  A  mouse  with  a  single 
button. 

•  Computing  hardware:  8MHz  MC680(X)  with  128K  (or 
512K)  RAM  and  64K  ROM  programmed  with  highly 
specialized  support  functions.  Full  IEEE  RoaUng  point 
numerics  via  software  emulation.  One  (or  more)  400K 
disk. 


V 


•  Language;  All  programming  in  an  extended  ISO  Pascal 
Program  resides  on  a  Macintosh  XL  (nde  Lisa  2/10)  and  is 
cross-compiled  for  the  Mac.  Currently  the  program  is 
about  20,000  lines,  but  it  makes  extensive  use  of  the 
support  provided  in  the  Mac  ROM  for  menus,  windows, 
controls,  etc. 

•  User's  environment:  The  environment  is  a  "Desktop 
metaphor'*.  The  user  sees  objects  on  an  imaginary 
desktop.  The  objects  can  be  moved,  grouped,  or  discarded 
by  dragging  them  with  the  mouse.  These  objects  open  into 
windows  to  reveal  their  contents.  The  windows  can 
overlap  each  other  and  can  be  repositioned  freely. 

Syntax 

The  basic  syntax  of  a  command  is  object(,  object,  ...) 
verb.  This  syntax  obviates  the  need  for  a  "Do  it"  button 
and  provides  the  opportunity  to  avoid  many  syntax  errors 
by  inactivating  commands  (verbs)  that  would  be 
inappropriate  for  the  arguments  (objects)  selected. 

Principles: 

•  Object-oriented.  The  screen  shows  graphic  objects 
(usually  as  icons)  that  represent  data  analytic  objects.  For 
example,  each  variable  has  an  icon,  so  a  particular  variable 
is  not  usually  referred  to  by  name,  but  rather  by  pointing  to 


issues:  The  major  issues  here  are  in  identifying  the 
appropriate  set  of  objects.  For  example,  one  could 
consider  making  each  case  an  icon  and  graphically 
gathering  samples.  One  could  consider  different  icons  for 
integer,  real,  text,  and  mixed  type  variables  so  that  their 
nature  would  be  immediately  obvious  on  the  screen. 
However,  we  need  to  balance  additional  information 
against  the  chance  of  overwhelming  the  user.  We  have 
settled  on  a  relatively  sparse  set  of  objects;  Variables  (of  a 
few  types),  collections  of  variables  (of  a  few  types), 
output  objects  (plots,  tables,  etc),  and  a  few  special 
objects. 

It  is  also  important  to  establish  consistent  behavior  among 
objects.  For  example,  the  same  physical  action  should 
have  similar  consequences  for  all  objects.  For  example. 


Opening  an  object  (on  The  Data  Desk,  by 
double-clicking  on  it  or  using  the  Open  command)  always 
reveals  its  internal  contents.  An  opened  variable  exhibits 
its  data  elements,  and  opened  plot  is  drawn  in  its  window, 
and  opened  bundle  of  variables  exhibits  the  icons  of  the 
variables  collected  together  and  their  order.  Windows 
must  also  behave  consistently.  A  window  exhibiting  data 
is  relocated  and  resized  in  the  same  way  as  one  exhibiting  a 
plot 

•  WYSIWYG.  What  You  See  Is  What  You  Gel.  At 
any  time,  the  screen  shows  the  current  state  of  the  data. 
That  is  not  to  say  that  the  screen  is  cluttered  with  a 
spreadsheet  of  data  values.  (Rather,  the  data  are  arranged 
however  the  user  wishes.)  But  one  can  immediately 
discover  the  contents  of  a  variable  or  the  state  of  an 
analysis  by  opening  the  approprite  icon.  Even  data  editing 
is  semi-graphical  in  the  sense  that  the  user  opens  a  variable 
icon,  points  to  an  errant  data  value,  and  types  the 


issues:  One  of  the  problems  with  WYSIWYG  operation  is 
that  WYDSIWYDG:  What  You  Don't  See  Is  What  You 
Don't  Get.  To  operate  on  an  icon,  the  icon  must  be  visible 

or  reachable  as  part  of  a  collection  of  icons  whose  icon  is 
visible.  Data  cannot  be  edited  out-of-sight  This  is  either  a 
restriction  (if  you  like  UNlX-style  operations  that  can 
change  everything  on  the  disk  with  one  keystroke)  or  an 
advantage  (if  you  want  to  be  protected  from  unanticipated 
consequences  of  global  operations.) 

•  User-Driven  operation:  The  user  is  in  charge  of  the 
interaction.  Any  operation  is  available  whenever  it  is 
reasonable  (but  see  the  next  item).  Dialogs  in  which  the 
user  is  asked  questions  are  limited  to  specific  details,  and 
have  defaults  that  can  be  accepted  by  pressing  a  single 
button  whenever  possible. 

issues:  We  have  taken  a  specific  stand  against 
"menu-driven"  packages  in  which  the  program  takes 
control  of  the  dialog  and  the  user  supplies  responses  to  a 
long  sequence  of  questions.  Menu  trees  in  our  design  are 
intentionally  short  and  are  actively  pruned  to  cut  away 
branches  that  would  make  no  sense  in  the  current  context. 


•  Error  Avoidance:  The  menus  (being  graphical)  are 
dynamic.  Only  those  operations  that  make  sense  for  the 
arguments  selected  are  available.  For  example,  if  only  one 
variable  has  been  selected,  the  "scatterplot"  command 
cannot  be  selected.  If  tests  or  confidence  intervals  are 
requested,  the  "pooled  t  for  offered  unless 

two  variables  have  been  specified  as  arguments. 

issues:  This  is  a  very  powerful  way  to  avoid  many  errors 
that  would  otherwise  require  error  messages.  It  simplifies 
interactions  with  the  user,  and  it  is  a  valuable  pedagogical 
technique.  Menu  items  that  are  not  active  are  still  visible, 
but  in  a  gray  type.  To  avoid  restricting  sophisticated  users, 
the  design  of  commands,  defaults,  and  dialogs  must  be 
made  with  an  understanding  of  the  statistical  properties  of 
the  procedures  involved. 

•  Customized  Controls:  Controls  are  graphic  images 
on  the  screen  that  serve  to  control  the  environment  or  the 
behavior  of  the  program.  They  are  manipulated  with  the 
mouse.  Thus,  you  can  push  a  button  by  pointing  to  a 
picture  of  a  button  on  the  screen  with  the  mouse  and 
pressing  the  mouse  button.  Because  they  are  graphic 
structures,  controls  can  be  designed  to  suit  a  specific 
purpose.  Why  should  you  press  a  button  labeled  F5  when 
you  could  press  a  button  labeled  "Delete  Data  File"? 
Controls  can  also  be  positioned  intelligently.  For  example, 
buttons  can  appear  directly  under  the  cursor  when  the 
cursor's  position  has  been  otherwise  fixed. 

issues:  The  design  and  positioning  of  controls  is  a 
specialized  area  worthy  of  further  consideration.  At 
present,  have  copied  work  done  by  others  (mostly  Apple) 
for  the  Mac,  but  an  argument  could  be  made  for  designing 
controls  customized  for  some  statistically-based 
operations.  For  example  a  control  might  slide  or  turn 
smoothly  to  control  the  turning  of  a  three-dimensional 
scatterplot.  This  is  an  area  of  future  research. 

Some  Consequences: 

Some  of  the  consequences  of  this  graphic  syntax  have 
become  clear  to  us  only  in  the  course  of  executing  the 
design.  Others,  only  in  the  course  of  teaching  100 


undergraduates  to  use  the  program  and  learning  from  their 
experiences.  Among  the  conclusions  worth  noting: 

•  There  is  no  need  for  unique  variable  names  or  for 
restrictions  on  characters  or  length  (within  reason). 
Variables  are  identified  by  pointing  to  them.  The  screen  is 
graphically  dynamic,  so  (for  example)  long  variable  names 
are  ordinarily  shortened  to' avoid  cluttering  the  screen.  To 
see  the  full  name,  point  to  the  variable  and  click  the  mouse 
button.  Thus,  for  example. 

Temperature  ‘C 

1x2 

123 

Things  I  never  told  my  father 
are  all  legal  variable  names. 

•  Commands  can  be  verbose  (and,  consequently,  more 
statistically  precise)  because  the  user  is  not  typing  them, 
but  rather  is  pointing  to  them.  Thus,  for  example,  the 
alternative  hypothesis  in  a  test  can  be  stated  very  explicitly 
as,  for  example:  pj  <  P2- 

•  Operation  speed  is  greately  improved.  (Empirically,  we 
have  observed  that  even  touch  typists  who  are  experienced 
users  of  interactive  statistics  packages  can  work  much 
faster  on  The  Data  Desk.  Certainly  students  doing 
similar  assignments  are  completing  them  faster  on  our 
program  than  on  the  widely  used  interactive  statistics 
package  we  have  taught  with  to  dute.) 

•  Learning  speed  is  greately  improved.  Computer-naive 
undergraduates  were  given  a  single  one-hour  lecture  and 
hands-on  drill.  After  that  they  were  on  their  own  with  very 
little  additional  support  needed.  (Teaching  assistants  were 
available,  but  were  not  asked  computer  questions  very 
often.) 

Note:  These  last  two  points  have  usually  been  thought  to 
be  mutually  exclusive.  Tutorial  programs  that  are  easy  to 
learn  usually  get  in  the  way  of  experienced  users.  Some 
programs  offer  a  "Do  you  want  verbose  prompts?" 
question  early  in  the  session  to  try  to  alleviate  the  problem. 


We  have  found  that  this  environment  is  both  easy  to  leam 
and  easy  to  use  with  no  changes  whatever.  It  appears  that 
this  stems  from  the  fundamental  simplicity  of  the 
interactions  on  the  desktop.  One  way  of  viewing  this  is  to 
consider  the  (folk)  "Law  of  Complexity  Conservation" 
which  sutes  that  there  is  a  fixed  amount  of  complexity  in  a 
given  type  of  program,  but  it  can  be  shifted  among  the 
designer,  writer,  novice,  and  expert  We  have  tried  to  shift 
as  much  of  the  complexity  as  possible  onto  our  shoulders 
and  off  of  the  shoulders  of  the  users. 

Problems: 

•  It  is  difficult  to  write  programs  (macros)  in  a  language 
that  lacks  a  written  syntax.  One  possiblity  is  to  "record" 
actions  to  play  back  later,  but  that  has  its  own  problems. 
While  we  have  a  design  completed  for  macros,  this  is  still 
an  area  for  further  research. 

•  This  style  of  user  interface  is  computing-  intensive.  We 
find  that  we  are  driving  the  Mac  fairly  hard;  anything  with 
less  power  than  a  68000  might  not  be  able  to  keep  up.  One 
absolute  requirement  is  sharp  graphics.  (We  haven't  felt  a 
need  for  color  yet  at  all.)  The  chief  bottleneck  (as  with 
many  Mac  programs)  is  the  disk  drive. 

Pleasant  Surprises: 

•  You  can  really  do  quite  alot  on  a  $2(X)0  microcomputer. 
The  Mac  is  a  very  powerful  machine,  even  in  its  128K 
size.  The  512K  machine  should  handle  substantial  size 
datasets. 


•  On  a  fully  integrated  system,  many  things  come  for  free. 
For  example,  it  took  no  effort  whatever  to  interface  our 
program  to  most  communications  packages  for  the  Mac  to 
make  up  and  down-loading  of  data  possible.  It  was 
straightforward  to  provide  the  ability  to  paste  output  and 
plots  into  word  processing  documents,  or  to  move  them  to 
graphics  programs  for  further  enhancement. 

•  The  environment  offers  some  unanticipated  pedagogical 
advantages.  For  example,  conunands  and  output  can  be 
sufficiently  verbose  to  be  statistically  precise.  Greek  and 
math  symbols  are  available  to  write  things  in  standard 
notation. 

Whither? 

The  Data  Desk  is  now  a  reasonably  stable  environment 
with  a  standard  collection  of  statistical  capabilties.  We 
have  been  using  the  program  in  a  second-term  statistics 
class  of  100  computer-naive  sophomores  with  success, 
and  will  make  it  available  for  general  use  by  Fall  term 
1985.  The  next  research  area  is  extensions  to  interactive 
graphics.  Much  of  the  design  of  these  ideas  is  completed, 
but  they  have  not  yet  been  implemented,  and  ate  thus  a 
subject  for  a  future  talk. 
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We  would  like  to  report  on  research 
about  some  advanced  methods  for  explora¬ 
tory  data  analysis  based  on  dynamic  com¬ 
puter  graphics.  These  methods  are  now 
feasible  because  current  hardware  allows  us 
to  recompute  and  redisplay  scatter  plots  of 
up  to  1000  data  points  five  to  thirty  times 
per  second,  thus  creating  the  illusion  of  con¬ 
tinuous  motion  in  a  plot.  Our  methods  are 
based  on  the  simple  idea  of  moving  projec¬ 
tion  planes  in  high  (4-10)  dimensional  data 
spaces.  That  is,  we  design  l-parameter  fami¬ 
lies  of  2-planes  in  p-space,  with  the  parame¬ 
ter  being  thought  of  as  time.  We  then  pro¬ 
ject  p-dimensional  quantitative  data  onto 
these  planes  in  rapid  succession  while 
increasing  the  time  pwameter  in  small 
steps,  which  generates  movies  of  data  plots 
that  convey  a  tremendous  wealth  of  informa¬ 
tion. 

We  call  these  dynamic  graphics  "grand 
tour"  methods.  In  our  presentation,  we  will 
show  a  short  (5  minutes,  16mm)  film  featur¬ 
ing  two  artificial  data  sets  (five  circles  in  10- 
space  and  a  3  dimensional  torus  in  6-space), 
and  two  well  known  real  data  sets:  the  Bos¬ 
ton  Housing  data  [1]  and  the  Particle  Phy¬ 
sics  data  (see  [2],  the  well  known  PRlM-9 
movie).  The  film  can  be  requested  from  the 
authors. 


It  may  be  true  that  any  single  aspect  of 
structure  in  data  can  be  isolated  and 
somehow  displayed  in  a  number  of  static 
plots,  but  the  grand  tour  offers  a  multitude 
of  aspects  simultaneously  and  in  relation  to 
each  other.  It  can  frequently  replaee  hours 
of  staring  at  plots  by  a  short  inspection  of  a 
movie  and  dramatically  reduce  the  probabil¬ 
ity  of  missing  structure  as  well.  In  our 
experience,  the  usefulness  of  this  type  of 
display  depends  less  on  the  dimension  of 
data  space  than  on  the  intrinsic  dimension 
of  the  data.  If  the  data  form  0-,  1-  or  2- 
dimensional  manifolds  (i.e.  clusters,  curves, 
or  surfaces),  the  humoti  eye  is  able  to  pick 
up  the  "gestalt"  almost  instantly  due  to 
motion.  If,  however,  the  intrinsic  structure 
is  of  four  or  higher  dimensions,  grand  tour 
methods  alone  will  not  necessarily  be  sue 
cessful,  and  other  tools  will  have  to  be  used, 
perhaps  in  conjunction  with  the  grand  tour. 

We  would  like  to  point  out  an  important 
aspect  of  the  grand  tour  whose  impact  is  not 
apparently  understood  in  a  current  discus¬ 
sion  of  projection  pursuit  contained  in 
P.J. Huber  and  discussants  [3].  Projection 
pursuit  in  its  original  version  is  the  search 
for  informative  projeetions  through  optimi 
zation  of  information  indices  as  functions  of 
data  projections.  Thus  the  output  consists 


of  one  or  several  data  plots  corresponding  to 
global  or  local  maxima  of  some  index  chosen 
by  the  data  analyst.  In  contrast,  the  grand 
tour  is  NOT  just  another  vehicle  for  finding 
interesting  static  plots,  and  it  is  not  simply  a 
competitor  of  projection  pursuit.  The  out- 
p\it  of  the  grand  tour  is  a  MOVIF,  with  all  the 
information  encoded  in  the  smooth  motion 
of  the  scatter  plots.  We  argue  that  the 
speed  vectors  of  data  points  in  a  grand  tour 
provide  two  additional  dimensions  of  infor¬ 
mation  in  addition  to  the  two  dimensions  of 
location,  thus  letting  us  perceive  a  full  4- 
dimensional  space  at  any  given  point  in 
time.  In  comparison  to  the  grand  tour, 
three-dimensional  rotation  is  degenerate  in 
that  one  of  the  two  infinitesimal  rotations  is 
held  fixed,  resulting  in  the  loss  of  one  dimen¬ 
sion  of  information. 

Dynamic  features  must  be  carefully  con¬ 
sidered  in  the  design  of  a  grand  tour.  To 
mention  a  few  desiderata: 

-  A  basic  requirement  is  (at  least  piecewise) 
smoothness  of  motion  to  avoid  jitter  in  the 
movie  and  prevent  fatigue  of  the  human  eye. 
The  smoother  the  motion,  the  clearer  will  be 
the  perception  of  the  information  encoded  in 
the  velocities.  Ideal  smoothness  is  achieved 
by  so  called  geodesics,  a  notion  which  is 
applicable  to  our  context  in  the  precise 
sense  of  differential  geometry.  Our  favorite 
implementation  is  actually  based  on  piece¬ 
wise  geodesic  motion. 

-  It  is  important  to  avoid  distraction  due  to 
excessive  within-screen-spin.  By  this  we 
mean  rotation  which  takes  place  within  the 
projection  plane  rather  than  in  the  embed¬ 
ding  space,  and  which  is  hence  uninforma 
tive  if  not  disturbing.  As  it  turns  out,  any 
given  grand  tour  can  be  modified  such  that 
it  avoids  within-screen-spin  completely, 
although  the  additional  computational 
expense  may  well  slow  it  down  to  an  unbear¬ 
able  extent. 

-  Another  desideratum  is  the  following;  the 
2  plane  in  data  space  which  encodes  velocity 


should  be  kept  orthogonal  to  the  projection 
2  plane  to  avoid  confounding  of  location  and 
speed  of  the  djmamic  scatter  plot  points. 
This  is  satisfied  by  the  above  mentioned  geo¬ 
desic  motion,  but  one  can  show  that  this 
requirement  confines  the  grand  tour  to  a 
fixed  4-space,  and  hence  must  be  abandoned 
if  the  tour  is  to  scan  5-  and  higher¬ 
dimensional  space.  In  our  implementation 
we  use  only  piecewise  geodesics,  which 
allows  us  to  scan  any  dimension  of  space. 

We  have  developed  a  set  of  tools  for 
designing  and  implementing  grand  tours. 
They  can  be  divided  roughly  into  two  classes: 

1)  Parametrization  of  planes  by  Euler 
angles,  and  design  of  paths  which  scan 
parameter  space. 

2)  Interpolation  between  randomly  selected 
planes  by  "shortest  paths",  and  analo¬ 
gues  of  splines. 

At  this  point  we  need  to  introduce  some  ter¬ 
minology  from  differential  manifolds.  Since 
the  actual  computer  implementation 
requires  a  pair  of  orthogonal  vectors  in  data 
space  for  the  calculation  of  horizontal  and 
vertical  screen  coordinates,  we  need  the 
Stiefel  manifolds  5gp  of  orlhonormal  2- 
frames  in  p  space.  Similarly,  since  we  would 
often  like  to  equivalence  all  data  projections 
which  can  be  transformed  into  each  other 
through  screen  rotations,  we  also  introduce 
the  Grassmann  manifolds  Ggp  of  2-planes 
in  p  space.  For  implementation  purposes, 
we  consider  a  grand  tour  as  a  curve  on  a 
Stiefel  manifold,  but  for  theoretical  and  con¬ 
ceptual  considerations  we  prefer  to  look  at  it 
os  a  curve  on  a  Grassmannian. 

The  parametrization  class  of  techniques 
mentioned  above  parametrizes  either  mani¬ 
fold  by  angles,  similar  to  the  way  longitude 
and  lattitude  parametrize  a  2-sphere. 
Angles  are  reeds  mod  2ti,  i.e.  elements  of  the 
circle  T’  =  R  mod  27t  ,  and  a  p-dimensional 
product  of  circles  is  a  torus  .  We  use  tori 
as  parameter  spaces  because  they  allow 
natural  curves  of  great  smoothness  and 


flexibility,  namely  the  ones  obtained  by 
pushing  straight  lines  from  I?*  into  TP.  If 
the  coordinates  of  a  vector  in  RP  are  linearly 
independent  over  the  rationale,  then  the 
straight  line  generated  by  this  vector  is 
dense  in  TP;  hence  the  resulting  grand  tour 
is  dense  in  the  Stiefel  or  Grassmann  mani¬ 
fold  if  the  peirametrization  is  onto.  We  have 
examples  of  parametrizations  of  the  Stiefel 
variety  as  well  as  the  Grassmannian.  For 
topological  reasons  they  cannot  be  1-1.  The 
techniques  for  parametrization  are  borrowed 
from  numerical  analysis  and  they  are  based 
on  concatenations  of  planar  rotations 
(Givens  transformations)  and/or  reflections 
on  hyperplanes  (Householder  transforma¬ 
tions).  Underlying  these  constructions  is 
the  fact  that  any  orthogoneil  mapping  can  be 
decomposed  into  a  sequence  of  Givens 
and/or  Householder  transforms. 

The  interpolation  class  of  techniques  for 
grand  tour  construction  is  based  on  succes¬ 
sively  sampling  planes  and  connecting  them 
by  motion  along  suitable  interpolation  paths. 
In  the  tour  version  we  will  show  in  our  movie, 
these  paths  are  geodesics  on  the  Grassman¬ 
nian,  which  are  described  in  an  article  by 
Wong  [4].  They  correspond  to  the  simultane¬ 
ous  interpolation  of  the  principal  angles 
between  two  2-planes.  This  scheme  results 
in  a  tour  which  lacks  smoothness  at  the  end¬ 
points  of  interpolation  paths,  but  geodesics 
enjoy  many  favorable  properties,  some  of 
which  we  mentioned  above  in  our  discussion 
of  dynamic  aspects  of  grand  tours.  Another 
nice  feature  is  the  low  computational  cost 
which  is  not  greater  than  that  of  ordinary 
3d  rotations,  at  least  when  the  tour 
proceeds  on  a  geodesic  path.  At  the  end 
point  of  a  geodesic  segment,  there  is  a  pause 
of  a  fraction  of  a  second  due  to  sampling  a 
new  random  plane  and  setting  up  the  param¬ 
eters  for  the  corresponding  interpolation 
segment.  In  practice,  viewers  do  not  find 
these  pauses  unpleasant,  on  the  contrary, 
they  perceive  ceaseless  motion  as 
overwhelming  and  tiririg. 


Currently,  we  are  in  the  process  of  con¬ 
structing  analogues  of  spline  interpolators 
on  the  Grassmanniem.  The  geodesic  tour  just 
described  can  be  considered  as  a  spline  tour 
of  order  zero.  Splines  of  higher  order  will 
lead  to  perfectly  smooth  motion,  but  will 
lose  some  of  the  simplicity  of  the  geodesic 
tour. 

D.Asimov  discusses  desirable  properties 
of  grand  tours  in  a  forthcoming  paper  |5]. 
He  states  that  asymptotically  a  tour  should 
form  a  dense  subset  of  Cgp  ,  whereas  in 
terms  of  finite  time  it  should  spread  out 
quickly  on  the  Grassmannian.  This  latter 
requirement  is  formalized  by  the  notion  of 
"minimal  amount  of  time  needed  to  get 
within  an  e  neighborhood  of  any  2-plane." 
Tlieoretical  lower  bounds  can  be  given  by 
comparing  the  volume  of  an  e-neighborhood 
with  the  total  volume  of  the  Grassmannian. 
It  is  clear  that  this  ratio  becomes  less  favor¬ 
able  for  higher  dimensional  data  spaces. 
(For  volume  computations  on  Grassmanni- 
ans,  see,  e.g,,  Santalo  [6],  Asimov’s  paper 
contains  tables  and  displays  which  indicate 
what  can  be  expected  in  various  dimensions. 
It  is  apparently  possible  to  come  within  12 
degrees  of  any  plane  by  watching  1800  ran¬ 
domly  sampled  planes  In  4  dimensions, 
whereas  20  degrees  are  possible  with  the 
same  number  of  planes  in  6  dimensions.  In  8 
dimensions  one  can  expect  only  39  degrees, 
and  in  10  dimensions  44  degrees.  Although 
these  figures  appear  very  discouraging  at 
first,  we  should  remember  that  this  type  of 
discussion  is  somewhat  academic,  as  it 
neglects  the  dynamic  nature  of  the  grand 
tour  which  lets  us  perceive  four  rather  than 
two  dimensions  at  a  time.  Second,  the 
dimension  of  the  data  space  is  less  of  a  fac¬ 
tor  than  the  intrinsic  dimension  of  the  data 
in  determining  how  well  we  can  perceive 
structure  in  data  (see  above).  Recognizing 
the  difficulty  of  finding  structure  of  low  co 
dimension  by  tour  methods,  we  plan  to  com¬ 
bine  an  interactive  and  dynamic  projection 
pursuit  version  with  the  grand  tour  as  this 


will  permit  the  grand  tour  to  remain  in 
neighborhoods  of  local  and  global  extrema  of 
information  indices  on  the  Grassmannian. 

In  the  previous  paragraph,  we  referred 
implicitly  to  metrics  on  the  Grassmannian 
when  we  mentioned  e  neighborhoods  of  2- 
planes.  We  seem  to  have  an  intuitive  notion 
of  what  is  meant  by  "distance  between  two 
2-planes",  but  there  are  ramifications  which 
we  will  explain  briefly.  The  best  formaliza¬ 
tion  of  our  intuitive  notion  is  probably  given 
by  the  maximal  angle  between  a  vector  in 
one  plane  and  its  projedtion  onto  the  other 
plane.  A  proof  is  necessary  to  show  that  this 
leads  to  a  metric  on  the  Grassmannian,  and 
a  simple  way  to  go  about  it  is  via  an  interpre¬ 
tation  in  terms  of  the  HausdorfT  metric  on 
the  unit  circles  in  p  space,  which  are  in  l-l- 
correspondence  with  the  2-planes.  This 
metric  can  also  be  defined  as  the  larger  of 
the  two  principal  angles  6^  and  flg  between 
two  2  planes.  In  some  sense  this  is  an 
metric  because  it  turns  out  that 
( define  metrics  on  the 
Grassmannian,  too,  which  we  call  Lp  metrics 
for  obvious  reasons.  Wong  mentions  the  fig- 
case  as  the  one  which  creates  the  Rieman 
nian  structure  on  the  Grassmannian.  The 
other  metrics  for  l<p<«>  generate  Finsler 
geometries  but  these  all  lead  to  the  same 
geodesics.  Notice  that  the  L„-case  does  not 
lead  to  a  Finsler  space  due  to  its  non- 
differcntiable  nature,  but  it  is  obtained  as 
the  limiting  case  of  a  I -parameter  family  of 
Finsler  geometries. 

In  what  follows  we  present  a  few  ideas 
which  greatly  increase  the  flexibility  of  the 
grand  tour  as  a  viewing  method  for  mul¬ 
tivariate  data.  The  grand  tour  described  so 
far  would  scan  too  many  projections  of  mod 
est  interest  in  many  situations.  For 
example,  in  the  case  of  predictor-response 
data,  one  would  like  to  concentrate  on  plots 
of  linear  combinations  of  responses  versus 
linear  combinations  of  predictor  variables, 
while  in  the  case  of  repeated  measures  data, 
one  would  like  to  concentrate  on  contrasts 


of  treatment  responses,  i.e.,  linear  combina 
tions  whose  coefficients  sum  up  to  zero.  In 
the  same  situation,  one  could  also  be 
interested  in  the  dependence  of  contrasts  on 
linear  combinations  of  covariates.  We  con 
elude  that,  for  practical  data  analysis,  one 
needs  modified  grand  tours  which  offer  more 
flexibility  in  the  choice  of  data  projections  to 
be  scanned.  For  predictor-response  data 
the  modification  consists  of  confining  a 
grand  tour  to  pairs  of  normalized  vectors 
which  scan  the  unit  sphere  of  predictor 
space  and  response  space  respectively.  The 
manifold  to  be  toured  simplifies  to  a  product 
of  spheres.  This  is  a  submanifold  of  dimen¬ 
sion  p-2  as  compared  to  2p-3,  the  dimension 
of  the  full  Stiefel  manifold.  We  will  show  an 
implementation  of  this  type  of  tour  in  our 
movie.  F’or  repeated-measures  data,  one 
would  confine  the  scanning  vectors  to  the 
space  of  contrasts,  i.e.,  the  vectors  which 
are  orthogonal  to  { 1, 1,1 ,...). 

Grand  tour  techniques  can  also  be 
brought  to  bear  in  contexts  which  are  rather 
different  from  those  we  have  considered  so 
far.  A  basic  data  analytic  operation  is  the 
comparifson  of  several  plots  of  one  given  data 
set.  The  problem  is  to  identify  cases  and 
groups  of  cases  across  two  or  more  plots.  To 
support  this  operation,  one  can  use  geodesic 
interpolation  of  two  projection  planes  to 
transform  one  scatter  plot  into  the  other 
dynamically.  This  makes  use  of  the  fact  that 
our  visual  system  keeps  track  of  the  identity 
of  moving  objects. 

Obviously,  there  are  many  more  possibil¬ 
ities  of  applying  motion  graphics  to  data 
analysis.  We  hope  that  the  grand  lour  will  be 
recognized  as  a  useful  tool  and  a  natural 
extension  of  3d  graphics.  Conceptually, 
higher  dimensional  motion  graphics  arc  at 
least  as  "intuitive"  or  "counter  intuitive'  as 
3  dimensional  ones,  and  some  important 
capabilities  of  the  visual  system  seem  to 
work  in  higher  dimensions  as  well.  Partial 
supportive  evidence  for  this  claim  will  be 
provided  by  our  film. 


REFERENCES: 

[1]  Belsley,  D.A.,  Kuh,  E.,  Welsch,  R.E., 
Regression  Diagnoslics  (1980),  Wiley. 

[2]  Fisherkeller,  M.A.,  Friedman,  J.H.,  Tukey, 
J.W.,  "PRlM-9'',  a  movie,  1974,  Stanford, 
CA. 

[3]  Huber,  P.J.,  and  discussants.  Projection 
Pursuit  Methods,  Ann.  of  Statistics 
{ 1985),  to  appear. 

[4]  Wong,  Y.-C.,  Differential  Geometry  of 
Grassmann  Manifolds,  Proc.  Nat.  Acad. 
Sci.,  vol.  57  (1967),  p.589f. 

[5]  Asimov,  D,,  The  Grand  Tour,  SIAM  Jrnl.  on 
Sci.  and  Stat.  Computing,  6:1  (1985), 


Nonlinear  Least  Squares  and  First-Order  Kinetics 


Dougtas  M.  Bates 
Dennis  A.  Wolf 

L'nKcrsitx  of  Wisconsin  -  Madison 
Donald  G.  W'atls 
Queen's  University*  at  Kingston 


One  of  the  persisieni  problems  with  the  use  of  nonlinear  least  ^c|uare<*  programs  is 
specifving  and  coding  model  functions  and  partial  derivatives  then  iiuorpoiating  this  code 
into  the  program.  We  show  how'  these  difficulties  can  be  b\pa<«ied  for  the  imponam  class  of 
models  defined  b\  linear  systems  of  differential  equations.  Not  onK  are  the  model  functions 
easih  S|>ecified  but  the  partial  derivatives  can  be  aniomaiicalix  geneiaird  K'  alUnv  the  uve  of 
sophisticated  opiimi7ation  algorithms  without  an  additional  burden  on  the  uvei.  These 
models  are  w  idelv  used  in  pharmacokinetics  and  chemical  kinetics. 

An  additional  problem  that  occurs  in  pharmacokinetic  analysiv  incorpoiaiion  of 
non-homoscedasiic  error  structures.  We  shov»  how  the  "transform  both  sides"  approach  due 
to  Carroll  and  Ruppen  can  be  used  with  this  model  sitecification  stiaiegv . 


1.  Introduction 

One  common  difficulty  with  using  nonlinear  regression  pro¬ 
grams  is  specifying  and  coding  the  model  function  and.  possibly, 
its  derivatives.  Specifying  the  mode/  function,  particufarfv  in  (he 
case  of  implicit  models  defined  by  systems  of  differential  equations, 
can  provide  an  opporrunify  for  the  user  to  make  syntax  and  tran¬ 
scription  errors  w  hich  take  a  long  time  to  detect  and  correct.  An 
e'en  more  fertile  ground  for  errors  is  specifying  and  coding  deriva¬ 
tives  of  the  model  function  with  respect  to  the  model  parameters. 
Jn  our  experience,  this  is  the  single  most  error-prone  stage  in  a 
nonlinear  regression  analysis.  Empirical  evidence  ol  this  difficulty 
is  the  popularity  of  derivative  free  methods  whether  based  on  finite 
difference  approximations  lo  the  derivatives  or  other  schemes  such 
as  DUD  (Ralston  and  Jcnnfich.  1978). 

For  one  important  class  of  models,  the  first-order  kinetic 
models  defined  by  linear  systems  of  differential  equations.  Jennrich 
and  Bright  (1976)  demonstrated  that  these  difficulties  can  be 
avoided.  They  gave  a  rcprcscnialion  of  the  solution  of  the  differen¬ 
tial  equations  in  terms  of  the  matrix  exponential  and  showed  that 
the  model  derivatives  can  be  computed  siniuliancoiisly  with  the 
mode)  function.  We  provide  a  different  derivation  with  greater 
generaiiiy  in  section  2  and  discuss  sonic  of  the  implementation  con¬ 
siderations  in  section  3. 

Linear  kinetics  models  are  widely  used  in  pharmacokinetics 
where  they  arc  called  "linear  compartment  models"  or,  simply, 
coniparimcni  models.  A  slraighlfonsard  application  of  nonlinear 
leaxi  squares  to  pharmacokinetic  data  is  often  inappropriate, 
though,  because  the  assumption  of  bomoscedasticify  (constant  vari¬ 
ance)  is  not  warranted.  Weighted  least  squares  methods  are  .some¬ 
times  used  but  we  have  found  the  transformation  method  of  Carroll 
and  Ruppen.  (1984)  to  be  simple  and  effective.  In  section  4  we 
describe  the  method  and  its  implementation,  then  give  some  exam- 
ple.«  in  action  5. 


2.  Linear  Kinetics 

A  fir.si-ordcr  kinetics  system,  such  as  a  romparimeni  model, 
is  one  described  bv  a  set  of  llneai  diffeieniia)  equniions.  In  the 
compartment  models,  an  organism  is  tou'idered  av  composed  of 
homogeneous,  well  mived  companmenis  which  communicate  with 
each  other  by  the  exchange  of  material.  A  drug  administered  to  the 
bloodstream  could  pasv  from  the  blood  to  iMxly  tissues,  back  into 
the  blood,  and  finally  be  eliminated  from  the  svsiem  through  the 
kidnev's,  for  example.  The  blood  would  l>e  ton.slde»ed  as  one  com¬ 
partment.  other  body  tissues  as  a  second  compartment,  while  the 
exterior  of  the  system  would  l>e  an  implicit,  third  compartment. 
Such  a  svstem  and  its  communication  paths  would  be  represented 
as  in  Figure  1 . 


Figure  I:  A  2’comportment  model 


The  somenitation  of  the  drug  in  the  various  comparimenis  at 
any  time  \  would  he  written 


In  a  s\5ttrm  vkilh  K  compaiiments.  i  would  l>e  K  din»eimonal. 
Tlir  kinencs  of  ihc  sysiem,  which  describe  bow  ihe  conceniralions 
change  with  time,  are  linear  If  we  can  represent  the  derivatlNCs  of  ^ 
with  respect  to  lime  as  a  linear  function  of  Hi.  Thai  is,  the  system 
Is  gONerned  by  the  system  of  differenlial  equations 

id)  = a(/)  +  i<')  (2.1) 

dt 

where  A  is  the  A  yysicm  niairix  which  does  not  depend  on  j  or 
/  andifr)  is  the  drivnifi  fiiiu non  for  the  system  which  indicates  how 
material  is  being  added  to  the  system. 

In  pharmacokinetics,  the  driving  function  is  u.^ualh  a  bolus 
injection  into  a  comparimeni.  corres|K>nding  to  an  impulse  or  5- 
funclion  in  that  compartment,  or  an  intravenous  infusion  into  a 
compartment,  corresponding  to  a  constant  input  function  in  that 
compartment  from  time  Iq  to  .  With  a  bolus  injection,  we  usu* 
ally  consider  the  injection  as  determining  initial  conditions 

So  =  =  31(0) 

but  we  will  find  it  convenient  to  consider  general  driving  functions 
in  this  section. 

These  systems  ate  often  described  in  terms  of  nnt'  (omianis 
denoted  which  give  the  multiplier  for  the  comnumication  from 
compartment  j  to  compartment  I  as  shown  in  Figure  I.  (By 
convention,  a  rate  constant  is  the  rate  constant  for  eltntination 
from  compartment  i).  The  system  in  Figure  1  would  correspond  to 
the  linear  differenlial  equations 
dyAt) 

rf-y.d) 

If  8,  -  (q,.  6,  =  (j,.  and  S,  =  (|j.  Ilif  sysiem  matrix  ij 

then 

-(6|  +  e;)  «, 

"  *;  “*3 

rite  solution  to  Ihe  5>xteni  (2, 1 1  uilli  di  ivinp  function  it/)  ix 
adl  '  r  ”  ’Id)  (2.2) 

where  r*^'  IX  the  matrix  determined  bx  the  x(iit'et};et;i  (H»xxet  xettex 


f'"  ^  /  w  - 


.a-f  A  r 


and  the  *  denotes  convolution.  Thai  is. 

I 

c  'ilM  -  fc 
P 

In  the  case  of  a  Ixtius  mjeciion  where  i(0  is  an  impulse  function, 
the  solution  (2.2)  collapses  to 

Sd)  -  (2. .3) 

I'sing  (2.2)  or  the  special  form  (2..^).  we  can  defermine  the 
Slate  of  the  system  at  any  time  i  and  hence  determine  the  N- 
dimensional  expected  response  vector  ij  for  a  nonlinear  regression 
model  w  here  the  response  being  considered  is  the  concentration  in 
one  compartment  and  the  ex^rerimenial  conditions  are  the  times 
I,  .1^1  .  .  .  N  at  which  this  concentration  is  measured.  How¬ 

ever,  we  can  also  use  the  same  technique  to  determine  the  deriva- 


where  P  is  the  total  number  of  parameters.  To  avoid  cumbersome 
expressions,  we  will  adopt  the  convention  that  a  subscript  p  denotes 
diffeientiation  with  respect  to  6^.  We  obtain  the  deiivativcs  by  dif¬ 
ferentiating  the  system  (2.1)  to  obtain 

^  ^  (2.4) 

which  is  simpK  anoiliei  linear  system  of  diffcieniial  equations  with 
system  matrix  A  and  driving  function  A^^yii)  The  solu¬ 

tion  is  thus 

a,,(i)  =  f'''*Mp3i(')  +  x,.d)l  U.5) 

where  a(t)  can  be  obtained  from  (2.2). 

Returning  to  Ihe  system  of  Figure  1.  suppose  that  the  input 
function  was  a  bolus  injection  of  known  amount  into  compartment 
1.  the  blood.  Since  the  "volume  of  distribution"  for  the  Wood 
would  generallv  be  unknown,  the  initial  concentrations  would  be 
repiesenied  as 

So  =  (Sx-O)’  (2.(') 

and  the  solution  would  be  given  by  (2.3).  Equation  (2.5)  collapses 

to 

This  m.-tx  still  serm  complixaied  but  the  pifces  are  rather  simple. 

Here 

io.i  =  310  ;  ='  To..<  0 

io.  =  (1.0)' 


Ijir)  =  r''-/l,e'’'aQ 


afjd)  = 

There  is  another  way  in  which  parameters  can  enter  the 
kinetic  svstem  and  that  is  as  a  "dead  lime"  or  lag  lime.  The  meas¬ 
ured  lime,  t,  mav  not  corresfxind  to  the  erfeclive  time  in  Ihe  sy  stem 
and  it  mav  lie  more  realistic  to  describe  the  kinetics  in  terms  of 

T  -  'O^ 

where  Iq  is  an  unknown  parameter.  This  modification  is  easily 
incorporated  into  (2.2)  and  (2.4)  to  generate  the  required  expected 


responses  and  dcrivalivcs. 


3.  linplemeotafion 

The  implcnjcniaiion  of  these  methods  invoKes  r^^o  considera¬ 
tions:  specifying  the  model,  and  performing  the  calculations  in 
(2.2)  and  (2.4). 

The  intxJel  can  he  S|>ecirird  h>  indicating  the  roles  of  the 
pai.tmeicrs  as  rate  constants,  initial  conditions,  dead  limes,  etc. 
through  a  parameter-use  matriv.  V\'e  ha\e  chosen  to  t/se  a  matrix 
\Mth  ?•  columns,  the  first  containing  the  parameter  number.  If  the 
paranteter  is  a  rale  constant  the  second  and  third  columns  indicate 
the  sot/fce  and  sink  coinpanments  u  ith  a  sink  of  0  indicating  elimi¬ 
nation.  Initial  conditions  oi  other  forms  of  dii\ii»g  functions  aie 
specified  with  negatise  \alues  m  (he  third  column  and  the  number 
of  the  affected  compartment  in  the  sevond  (olunrn.  .^  -1  in  the 
third  column  indicates  the  le'el  ol  an  impulse.  2  iiuhiates  the  level 
of  a  consumi  inft»si(>n.  •  '  imhcaies  the  sloj>e  of  a  lineai  infiiSKm. 
etc.  1  his  si'ecifK anon  scheme,  combined  s'lth  the  ic’-uuiable  pro- 
peit\  of  linear  kinehc  s\siems.  can  he  used  to  itunfel  a  rJriving 
function  using  vpliues.  To  indicate  a  lag  time,  vve  use  a  ?eio  m  the 
second  column. 

As  an  cvample,  the  parametei-use  matrix  for  (he  system 
described  in  Figure  1  with  the  initial  conditions  {2.h)  is 

1  I  0 

2  1  2 
}  2  1 
4  I  -1 

Using  this  information  and  the  curicnt  parameter  xaUies.  a  pro¬ 
gram  can  generate  4  andKO. 

Notice  that  this  scheme  allows  a  single  parameter  to  have 
multiple  uses.  Changing  thc-parametcr-use  matrix  to 
1  1  0 

2  J  2 

2  2  1 

3  1-1 


tines  such  as  those  in  Eispack  (Smith  et  al.,  1976)  will  usually 
return  a  decom|>osiiion  even  in  degenerate  cases  and  the  onlv  clue 
that  the  det'oin)>osiiion  doe.snT  exist  is  that  U  has  a  huge  condition 
numhei  .  Bavciy  and  Stewart  (1979)  provide  a  method  to  reduce  A 
to  a  block-diagonal  form  which  can  be  used  to  evaluate  (he  matrix 
exponential  in  these  cases.  The  method  can  be  implemented  in  a 
straightforward  fashion  but  is  too  lengthy  to  describe  here. 

Assuming  then  that  software  such  as  Eispack  code  can  pro¬ 
duce  the  decomposition  (.T.l)  with  a  well-conditioned  U.  it  is  con¬ 
venient  to  pre-multiply  all  the  system  vectors  by  U~  '  to  produce 

t«)  =  ir'30) 

£o<')=  ’ao(/) 

K(/)  =  U-  ’id) 

£,,(0= 

and.  ftnally. 

c,  =  ir'A^,U 

The  notation  for  £^,(r)  and  k^,(0  is  not  con.xisfent  with  earlier  usage 
since,  for  example,  4^,(f)  is  not  the  derivative  with  respect  to  6^  of 
£(/).  It  is  convenient  though. 

Expressions  (2.2)  and  (2.5)  now  become 


£(()  =  f '"kC/) 

(.3.6) 

and 

£,,(»)  =  +  is,,(/)) 

(3.7) 

Because  is  diagonal,  the  convolutions  can  be  evaluated  as 

scalar  convolutions.  For  example,  with  an  impulse  driving  func¬ 
tion,  (3.6)  and  (3.7)  reduce  to 

^5“ 

< 

X. 

II 

(3.8) 

and  re  fitting  the  model  will  allow  testing  of  (he  hypothesis  that 

A,:  -  A;j. 

Once  A  and  i(/)  have  Iven  deiermincd,  the  expressions  in 
(2.2)  and  (2.5)  must  be  evaluated.  Moler  and  Van  Loan  (1978) 
give  an  extensive  survey  of  methods  for  the  niairix  ex{X)nential  and 
conclude  that  methods  based  on  an  eigenvalue-eigenvector  decom- 
jX)siiion  of  A  should  l>e  used  when  evaluations  for  man)  different 
/’s  are  required.  If  the  eigenvalues  of  A  are  real  and  there  is  a 
complete  set  of  eigenvectors  so  we  can  write 

<3.0 

with 

A  ===  diag{K^ . X^)  (3.2) 

then 


and 

£pd)  =  <•'’£<,  +  r''£o,, 

Each  elcnieni  in  ihc  con\o)ul)on  matrix  i.^ 


where 


X  r  X  /  X  »  X  » -  X  1 
■  *r  f  :=  r  f  Ir  >  '  J  J 


r  ■  'r  '  =  r  '  r 


X  I  X  f 

e  '  -  c  ' 


K,*Kj 


r  r 

If  '  X  ,  ~  X  ^ 


(.V9j 


(3.10) 


(3. 1 1 ) 


e^'  ^  (3.3) 

where 

. /*') 


which  immediately  giv'CS  an  evaluation  for  lnipul.se  driving  func¬ 
tions  thiough  (2. .3). 

One  difFicultx  here  is  that  the  decomposition  in  (3. 1 )  does  not 
always  exist,  even  for  non-pathologica)  cases,  and  the  detection  of 
those  cases  is  quite  difncult.  .Standard  eigenvalue-eigenvector  rou- 


In  practice,  the  condition  is  deiermined  In  (omparing 

I  (X,  -  X  ,  )r  j  to  the  lelaiive  machine  pi ecivion. 

Since  thiv  imj'lemeniaiion  uvpv  the  i.ne  lonvt.imv  clnectK  and 
the  (imstant'-  imisi  bp  non  necalive.  the  anual  paiameiers  that  we 
live  are  the  l(•‘•.llithm  ihr  late  vonviatiiv  .md  of  the  unknown  ini 
n.il  (oncenti.itH'iis.  1  hiv  avoids  having  to  uxe  loiisirained  opiimi 
nation  methods  (or  phvsicallv  meaningful  parameiei  esiinialion.  It 
does  produce  a  minoi  diff'iculiv  when  a  pariicular  path  is  not 
needed  lot  the  model  snue  the  e'^immie  of  the  log  line  constant 
lendv  in  nrgaiivc  infiiniv  Miiv  vnnation  is  easilv  detected  h\  the 
itsei  and  the  m(>de)  le  speiilied 


4.  Heicroscedaslicily 

With  measiiremenis  of  physicaLquaniiiies.  such  as  drug  con- 
ceniraiions.  i(  is  noi  uncojnnion  *o  havf  ihf  level  of  the  noise 
increase  with  the  level  of  the  signal  so  nonlinear  regression  model¬ 
ling  with  a  constant  variance  assumption  is  inappropriate.  A  realis¬ 
tic  filling  of  compartment  models  should  include  some  method  of 
allowing  for  changing  variances  in  the  noise.  Some  weighted 
least-squares  methods  have  been  used  (Jennrich  and  Bright,  1976. 
Kramer  el  al.,  1974.  Wagner  and  coworkers.  1977)  but  the 
weights  are  often  chosen  on  an  ad-hoc  basis  and,  more  impor¬ 
tantly.  the  weights  are  often  based  on  the  observed  concentrations 
rather  than  the  predicted  concentrations. 

.Several  related  transformation  methods,  which  model  the 
changing  variance  as  a  function  of  the  response  level  and  thus 
account  for  heicrosccdasticitj ,  have  been  pro|wscd  (Box  and  Cox. 
1964.  Carroll  and  Ruppert,  1984.  Pritchard,  Downic.  and  Bacon. 
1977).  We  find  the  Carroll  and  Ruppert  approach  to  be  reasonable 
and  easy  to  iniplemenl.  This  uses  the  Box-Cox  transformation  fam- 

•W 


,.ix»  _ 

^  log(v)  X  =  0 

in  what  Carroll  and  Ruppert  ca  I  "transforming  both  .vidcs". 

For  a  given  value  of  X,  the  estimates  are  determined  by 
fitting  the  transformed  data  to  the  transformed  model  function 
resulting  in  a  loglikelihood,  up  to  a  constant,  of 

A  „ 

/(X)  =  X  2:  log(.V,)-f  Iog(S(fi,))  (4.2) 

i«l  ^ 

which  is  then  optimized  over  X.  Since  the  derivatives  of 
with  rest>eci  to  fi  are  easily  calculated  from  we  can  use  the 

methods  of  the  previous  section  to  calculate  models  and  derivatives 
for  iransforined  compartment  models. 

The  loglikelihood  function  over  a  range  of  X  can  give  an  indi* 
cation  of  what  are  "reasonable"  values  for  X.  In  some  cases,  as 
shown  in  the  following  section,  there  is  very  little  sensiiivin  of  the 
d.iia  to  iransformaiion  and  X  is  esscniialK  irrelevant.  In  other 
cases,  the  value  of  X  Is  sharply  determined  and  the  need  for 
transformation  clcarls  defined.  V\e  examine  the  plot  of  the 
loglikelihood  versus  X  to  determine  a  reasonable  and  "natural" 
value  of  A  (usually  0.  W2,  of  1)  and.  using  the  rationale  of  Box 
and  Cox  (1982)  or  Hinklrv  and  Hunger  (1984).  condition  the  sub- 
seqtieni  anaivsis  on  that  value  of  X. 

S.  Examples 

We  consider  three  examples  from  the  literature  to  demon¬ 
strate  the  application  of  the  iransformatton  approach  for  homos- 
tecla’iiictiv  and  the  nexlbiliiy  of  nuxlel  description.  7he  Brunhilda 
data  from  .lennrich  and  Bright  (1976)  shown  in  Table  I  are  blood 
concentrations  of  sulphate  measured  by  a  radioactive  assay.  The 
tesuhs  are  quoted  as  counts,  .lennrich  and  Bright  fit  a  tliree- 
rompartmeni  caienar)  mode)  (Figure  2)  to  these  data  using 
weighted  least  sqtiares  w  ith  the  weights  proportional  to  >,' *  and 
assuming  an  initial  concentration  corresponding  to  a  count  of 
2x  !()'.  We  fit  the  same  model  but  with  a  sixth  parameter  of  the 
initial  count  in  compartment  one  and  using  the  power  transforma¬ 
tions. 


time 

min. 

activity 

counts 

2 

I5I1I7 

A 

113601 

6 

97652 

•  8 

■90935 

10 

84820 

IS 

76891 

20 

73342 

25 

70593 

30 

67041 

40 

64313 

so 

61554 

60 

59940 

70 

57698 

SO 

56440 

90 

5.1915 

I  10 

50936 

1.10 

467)7 

ISO 

45996 

160 

44968 

170 

43602 

IRO 

42668 

Table  J:  Data  fiont  Jennrich  and  Bright  (1976) 


Figure  2:  A  3-comportment  cotenory  model 


The  loglikelihood  of  X.  along  with  the  data  in  the  original 
count  scale,  is  shown  in  Figure  .1.  For  X,  the  MLE  was  about  -0.1 
with  wide  95%  confidence  limits  of  -2  to  1.75  so  w-e  selected  X  =  0 
(log  transformation).  The  fitted  parameters,  confidence  limits  and 
parameter  use  matrix  are  shown  in  Table  2.  In  addition,  the 
parameter  estimates  for  X  =  1  are  included  for  comparison. 

The  parameter  estimates  are  very  insensitive  to  transforma¬ 
tion  primarilv  because  the  relative  range  of  the  responses  is  not 
large.  The  ratio  of  the  largest  to  the  smallest  observation  is  3.54:1 
and  even  the  logarithm  transformation  is  fairly  linear  over  this 
range  as  show  n  in  Figure  4b. 

We  also  show  the  observed  and  predicted  responses  and  some 
of  (he  residual  analysis  in  JTpure  4.  The  residuals  for  this  model 
do  not  show  suspicious  patterns  hut  fitting  these  data  with  a  two 
compariment  nu>del  did  produce  noticcahlc  patterns  in  the  residu¬ 
als.  The  need  for  a  three  compartment  model  to  adequately 
rcpiesen!  these  data  was  confirmed  with  an  F-icst. 
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Par.  Dse 


E5I.(0) 

95'«  coni,  ini 

l.sl.(l) 

0. 00941 

0.008.3.0  0104 

0.(K)972 

0.2848 
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Tabic  2:  Paramncr  csiiniaies  for  Brunhilda  data 
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One  point  of  interest  about  the  Httcd  parameters  is  that  the 
initial  activir»  assumed  by  Jennrich  and  Bright  (1^76).  2xJ0  ,  is 
not  included  in  the  confidence  limits  for  6^.  If  the  model  is  fitted 
on  the  log  scale  uith  an  initial  activity  of  2x10  the  residual 
sum-of-squaies  is  0.00287  with  )6  degiees  of  fieedom.  Including 
6^  in  the  model  produces  a  residual  sum-of-squares  of  0.000878  so 
the  calculated  F  statistic  for  a  lest  of  8^  =  2x  JO'  is  34.06  with  I 
and  15  degrees  of  freedom.  Besides  the  formal  F-tesi  demonstrat¬ 
ing  that  2x  JO'  is  a  poor  value  of  we  also  found  that  the  residu¬ 
als  for  that  fit  exhibited  poor  behavior. 

A>  a  second  example,  we  consider  the  digoxin  dal.?  from  Kra¬ 
mer  ct  al.  (1974).  A  rapid  (bolus)  intravenous  injection  of  I  mg. 
of  this  drug  was  administered  to  five  heallliy  male  volunteers  and 
blood  samples  were  periodically  withdrawn  and  assayed  using  a 


'•'l  radioimmunoassay.  The  data  from  person  DL,  consisting  of 
serum  digoxin  concentrations,  is  show-n  in  Table  3.  Kramer  ei  al. 
(1974)  fit  the  data  to  the  three-compartment  mammillary  model  of 
Figure  5  using  weighted  Ica.si  squares  with  weights  proportional  to 
exp(-0.294y,).  These  weights  were  obtained  from  a  separate 
experiment. 

We  again  found  that  X  =  0  apt>eared  to  be  a  suitable  choice 
but  this  time  the  plot  of  the  ioglikelihood  versus  X  (Figure  6b)  indi¬ 
cates  a  fair!)  short  range  of  acceptable  X  values.  The  NILE  is  at 
about  0.1  with  approximate  95^  confidence  limits  of  -0.1  toO.35  . 
The  fitted  parameters,  confidence  limits  and  parameter  use  matrix 
are  shown  in  Table  4  along  with  the  parameters  estimated  with 
X  =  1.  In  this  example  the  difference  between  the  parameters 
estimated  using  an  unweighted  analysis  and  those  obtained  from  an 
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0.70 
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0.43 
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Table  3:  Data  from  Kramer  ct  al.  (1974) 
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Figure  5:  A  S-comporlment  mammillBry  model 
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analysis  o(  the  (ogs  i$  striking  hut  here  the  ratio  of  maximum 
observation  to  the  minimum  ohserxation  is  greater  than  fift)'  so  the 
log  transformation  is  quite  nonlinear  as  shown  in  Figure  7b.  The 
fitted  values  rej>orted  in  the  original  paper  differ  only  slightly  from 
those  here.  The  residuals,  displayed  in  Figures  7c  and  7d,  do  not 
demonstrate  disturbing  panerns. 
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Table  4:  Parameter  estimates  for  Person  UL 
Kramer  cl  al.  (1974) 
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Table  5:  Daia  from  Kaplan  el  al.  (1972) 
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Table  6:  Parameier  eMimaie<^  for  Subject  5 
Kaplan  ei  a). 
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Figure  7d 


Both  these  examples  indicate  the  need  for  a  three- 
compartment  model.  In  practice,  the  use  of  T\»o-comparimenl 
models  is  much  more  common  such  as  the  example  from  Kaplan  el 
al.  (1972)  Mho  studied  the  pharmacokinetic  profile  of  sulfisoxazole 
in  man  after  a  bolus  2  f.  intraxenous  injection.  The  data  from 
Table  5  Mere  fit  to  a  tMo-compartment  model  Mith  the  results 
xhoun  in  Table  6. 

The  loglikclihood  curNC.  plotted  in  Figure  8.  achieves  a  max¬ 
imum  at  about  0.7  uiih  approximate  confuleuce  limits  of  0..^5 
to  0.95  so  we  chose  a  convenient  k  of  0.5.  The  estimates  rbiained 
fiom  the  uniransfornied  data  fit  fall  Miihin  the  confidence  limits 
obt. lined  using  the  optimal  X  The  tiansformaiion  is  quite  linear 
over  most  of  tlie  tniige  of  coruentiations.  hut  as  the  severiix  of  the 
II  ansfcu  m.nion  iiiKiedseK.  if  A  deifease'.  the  last  observation 
become^  more  iiii|Kiriant  in  delci mining  the  lit.  Again,  the  resi¬ 
dual  analysis  in  Figure  9  does  not  reveal  suspicious  panerns. 
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6.  Discussion 

The  calculation  of  in  section  2  can  easily  be  generalized. 
For  eianiple,  second  deriiaiiies  represeni  soluiions  lo  systems  of 
the  form 

which,  again,  have  a  solution  through  convolution  as 

31,,r  = 

since,  in  our  representation. 


and 


As  mentioned  in  section  2,  the  meilio<l  of  Hasel>  and  Stewart 
(1070)  allows  generalization  to  s\siein«>  wheie  theie  are  degenerate 
eigenspaces  no  t  ’  does  not  esisi  These  meiliod^  can  aKo  eviend  to 
the  ca**?  of  comples  eigen\alues  whith,  though  rare,  can  (hcui  in 
piactice. 

In  some  chemical  niotlellmg  situations,  the  rate  constants 
mas  be  gisen  as  functions  of  Oliver  e\|>erimeinal  settings  such  as 
lenjfTeraiuie  and  piessure  Ibe  A/rhenius  model  is  often  used  for 
this.  The  chain  rule  can  be  used  to  obtain  ihe  deiis.mises  with 
respect  to  the  Arrhenius  parameters  gisen  the  tleiis.'Uises  for  the 
rate  constants.  The  important  area  of  modelling  pliarmacokineiic 
parameters,  such  as  elimination  rate  constants,  for  cntiie  popula¬ 
tions  is  addressed  bs  NOVMHM  (Beal  and  Slieinei.  |0R4).  Mans 
of  the  pharmacokinetic  parameters  of  interest  are  functions  of  the 
rale  constants  and  drising  functions  so  the  deiixatives  with  lesjrect 
to  these  parameters  can  be  obtained  through  llie  results  of  sections 
2  and  .T 

Another  situation  that  occurs  in  chemical  modelling  is  the 
asailahilitv  of  measurements  on  more  than  one  response.  Ihe 
deri^ alines  of  the  model  functions  from  section  2  can  be  u^-ed  in  the 
generalized  Oauss-Newion  algorithm  (Bates  and  VKalts.  1984. 
Bates  and  Walts.  1985a)  to  minimize  Ihe  Bos  Draper  estimation 
criterion  (Box  and  Draper.  1965)  which  takes  in  account  cot  rela¬ 
tions  between  responses.  Applications  of  mulii-iestxrnse  estimation 
for  sNSieiiis  of  linear  differential  equations  are  given  in  Bates  and 
Watts  1 1985b). 


The  approach  of  differentiating  the  diffeieniial  equations  to 
obtain  the  “sensitivity  functions"  or  derivatives  with  respect  to 
model  parameters  has  been  used  by  Caracotsios  and  Stewart  (1985) 
in  more  genera)  reactor  modelling.  Tfieir  methods  apply  lo  mixed 
systems  of  differential  and  algebraic  systems  as  well  as  to  certain 
types  of  partial  differential  equations. 

Using  transforntniions  to  deal  with  heieroscedasticity .  as 
described  in  section  4,  is  a  (lowerful  technique  but  it  can  result  in 
using  too  many  parameters.  Alany  of  lire  data  sets  for  which 
compartnteni  models  are  used  consist  of  a  dozen  or  fewer  observa¬ 
tions.  Even  adding  one  parameter  to  account  for  heieroscedasticity 
could  result  in  "over-ntiing"  the  data.  It  also  opens  Ihe  possibility' 
of  masking  deterministic  inadequacies  of  the  model,  using  a  2- 
compannteni  model  where  a  5-comparimcni  model  is  appropriate 
sa\.  b\  changing  the  stochastic  part,  that  is  altering  X.  The  sensi¬ 
tivity  of  the  deiermlnisiic  model  to  the  transformation  for  bomos- 
cedasiicity  is  considered  in  Wolf  (1985). 
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COMPUTATIONAL  EXPERIENCE  WITH  CONFIDENCE  REGIONS  AND 
CONFIDENCE  INTERVALS  FOR  NONLINEAR  LEAST.  SQUARES' 
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We  present  the  results  of  a  Monte  Carlo  study  of  several  methods  for  constructing  confidence  regions 
and  confidence  intervals  about  parameters  estimated  by  nonlinear  least  sipiarcs.  We  compare  the  esti¬ 
mates  produced  by  the  most  coininonly  discussed  methods,  naiiirly  thr  larli-of-fit  method,  the  likelihood 
method,  and  three  variants  of  the  linearization  method.  The  linearization  method  is  computationally 
inexpensive  and  produces  easily  understandable  results,  while  the  likelihood  and  lack-of-fit  methods  both 
are  much  more  expensive  and  more  difficult  to  report.  In  our  teats,  both  t.  e  lack-of-fit  and  likelihood 
procedures  perform  very  reliably ,  but  all  three  linearization  methods  often  pr,idure  gross  underestimates 
of  confidence  regions  and  soniet lines  produce  significant  underestimates  of  confidence  intervals.  Among 
the  three  variants  of  the  linearization  method,  the  variant  based  solely  on  the  Jacobian  appears  prefer¬ 
able  to  the  two  variants  that  utilize  the  full  Hessian,  because  it  is  cheaper  to  compute,  and  is  always  as 
reliable  as  the  other  two  variants  and  sometimes  more  reliable.  Cases  when  the  linearization  method 
confidence  regions  will  be  poor  appear  to  be  reliably  predicted  by  the  Bates  and  Watts  parameter  effects 
curvature  diagnostic. 


1,  Introduction 

This  paper  presents  the  results  of  an  empirical  study 
comparing  several  methods  for  constructing  confidence 
regions  and  confidence  intervals  about  parameters 
estimated  by  nonlinear  least  squares.  The  methods  com- 
paretl  .are  the  lack-of-fit  method,  the  likelihood  method, 
anti  three  variants  of  the  linearization  method. 

The  need  for  confidence  regions  and  intervals  com¬ 
monly  arises  in  data  fitting  applications,  where  a  response 
variable  y,  observed  with  unknown  error  e,  is  lit  to  m 
fixed  predictor  variables  x,  using  a  function  /(x,:0)  which 
can  be  either  linear  or  nonlinear  in  the  p  parameters  •. 
Thr  function  /(x,;fi)  is  linear  in  •  if  it  can  be  written 

/ix,;0)  *  x,«  « 

()i  (irrwiNO.  il  is  nonlinear.  The  methods  analysed  in  this 
sMidy  arc  identical  when  /(x,;0)  is  linear  in  0;  otherwise 
they  arc  not. 

When  the  error  is  additive,  the  response  variable 
can  be  modeled  by  ' 

y,  ■  /(x.;0)+e„  I-  1 . n, 

where  0  denotes  the  true  but  unknown  value  of  the 
parameters.  The  least  sf^iiares  estimator  of  0  is  the 
parameter  value,  denoted  0,  which  niinimir.es  the  sum  of 
the  sfpiares  of  the  residuals,  where  the  residuals,  f,f0),  are 
estimates  of  the  random  error, 

^(•)  *  y,-/(x,;0). 


Thus. 


0  »  arg  min  5(0) 


where  5(0)  i.s  t  he  residual  sum  of  squares, 

(•I 

with  R(0)  denoting  a  column  vector  with  i**  compoDeot 
f,(0).  and  R(0)^  denoting  the  transpose  of  R(0). 

In  our  study,  we  assume  that  the  model  is  correct 
an<l  that  the  errors  are  normal,  independent,  identically 
distributed  random  variables  with  rero  mean  and  vari¬ 
ance  d‘.  i.e.,  distributed  as  /V(0,d^  I).  Then,  the  least 
squares  estimator  0  is  the  maximum  likelihood  estimator 
of  the  parameters  0  of  the  p-variate  normal  density  func¬ 
tion, 

/,(Y)  -  (2X0^)-"^  ef-c’-c/cc*) 

where  Y  is  a  column  vector  with  i**  component  y,,  and  e 
is  a  column  vector  with  i'*  component  e,. 

Nearly  normally  distributed  errors  are,  in  fact, 
encountered  quite  frequently  in  practice.  This  is  because 
measurement  errors  are  often  the  sum  of  a  number  of  ran¬ 
dom  errors  from  unknown  sources,  and,  by  the  central 
limit  theorem,  the  sum  of  these  errors  is  approximately 
normally  distributed  whatever  the  distribution  of  the 
individual  errors  that  make  up  the  sum. 

In  prarlire,  the  estiro.ited  values  of  the  parameters  i 
will  not  equal  the  true  values  t  because  of  the  random 
errors,  e,,  in  the  data.  .Since  0  is  a  random  variable,  how¬ 
ever.  it  may  be  possible  to  indicate  with  some  specific 
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probabiiitv  l-a  in  what  region  about  i  we  might  reason¬ 
ably  expect  i  to  be.  Such  regions  are  known  as 
I00|l-a)^r  confidence  regions.  A  joint  confidence 
region  about  all  of  the  parameters  is  defined  using  a  func¬ 
tion 

Cli^  :  Y-*  a  region  in 

which  satisfies 

rr{  i  €  !  -  i-a. 

Similarly,  a  confidence  interval  about  an  individual 
parameter  6^  is  defined  using  a  function 

CIj  ^  :  Y-  an  interval  in  R 

which  satisfies 

/V[  e,  €  r/,.(Y)  1  -  l-a. 

The  above  definitions  state  that,  before  the  data  are 
sampled,  the  probability  that  the  confidence  regions  and 
confidence  intervals  to  be  constructed  will  contain  the 
true  parameter  values  is  1  — a.  Thus,  if  we  repeatedly 
draw  samples  and  construct  confidence  regions  and  inter¬ 
vals  about  the  least  squares  estimates  for  each  sample, 
then  in  the  long  run  100  (1  — a)^  of  these  confidence 
regions  and  intervals  should  contain  the  true  values.  Pro¬ 
cedures  that,  for  all  functions  /(x,;6)  and  confidence  lev¬ 
els  1—0.  are  statistically  guaranteed  a.syroptotically  to 
eontain  the  true  value  100  (1  — o)^  of  the  time  are  called 
exact;  all  other  procedures  are  called  approximate. 

Various  methods  have  been  proposed  for  calculating 
confidence  regions  and  intervals  for  parameter  estimation 
by  nonlinear  least  squares.  These  include  several  variants 
of  the  linearization  method,  as  well  as  the  likelihood  and 
lack-of-fit  methods.  (See  e  g.  Bard  (1974),  Gallant  (1976), 
Draper  and  Smith  (1981).|  We  review  all  these  methods 
briefly  in  Section  2.  They  all  are  equivalent,  and  exact, 
for  linear  models.  For  nonlinear  models,  only  the  lack- 
of-Tit  method  for  computing  confidence  regions  is  exact; 
the  other  methods  for  computing  confidence  regions  and 
all  the  methods  for  computing  confidence  intervals  are 
approximate.  The  linearization  regions  and  intervals 
appear  to  be  the  most  approximate  for  nonlinear  models, 
but  they  also  arc  far  less  expensive  to  compute  than  the 
likelihood  or  lack-of-fit  regions  and  intervals,  and  are  the 
predominant  methods  implemented  in  production 
software.  .Some  nonlinear  least  squares  packages,  includ¬ 
ing  NK2SOL  [Dennis,  Gay.  and  Welsch  (1981)],  include 
three  variants  of  the  linearization  method,  which  differ 
only  ill  that  the  variance-covariance  matrix  of  the 
estinialed  parameters  is  approximated  in  three  different 
ways,  namely 

V.  - 

V,  - 

or 

V.  -  ,-H(e)-'  (j(e)rj(§))H(e)-', 

»hrrr  ^  .*>'(•  VI n  —  p )  i,  the  ('stiinalrd  residual^  vari- 
a.irr;  J(*)  (lie  .larohian  of  /(x,;6),  i■"l,  ,n,  al  •;  and 
HI*)  !■<  the  llrs.<)ian  of  .*»(#)  at  f. 

Seel  ions  3-ft  of  this  paper  describe  and  analyze  a 
Monte  C’arlo  slu<ly  that  compares  all  of  these  methods  for 
computing  fonfidrnrr  regions  and  intervals  on  20  non¬ 
linear  models.  The  study  is  used  to  empirically  observe 
how  often  the  true  parameter  values  are  contained  in  the 


confidence  regions  and  confidence  intervals  constructed 
using  a  given  method.  The  actual  percent  of  the  nomi¬ 
nally  100  (1- a)/S  confidence  regions  and  intervals  which 
are  found  to  contain  the  true  value  is  known  as  the 
ob.scrved  coverage.  The  observed  coverage  will  generally 
depend  on  the  method  used  to  construct  the  confidence 
regions  and  confidence  intervals,  on  the  oomioal 
confidence  level,  l-a.  on  the  degree  of  nonlinearity  of 
the  function,  /(x,:6).  and  to  a  small  extent,  on  the 
number  of  replications  in  the  simulation.  If  the  experi¬ 
ment  used  to  generate  the  data  is  repeated  a  large 
number  of  times  under  the  same  conditions,  and  if  CR^ 
and  Cfja  exact  and  the  model  is  correct,  then  the 
observed  coverage  will  approach  the  nominal  coverage. 
When  CRg  and  G/^  *  are  only  approximate,  the  observed 
coverage  will  not  necessarily  approach  the  nominal  cover¬ 
age,  although  one  would  hope  that  the  difference  between 
the  observed  and  nominal  coverage  for  a  reasonable 
approximate  method  would  be  small  for  most  functions. 

No  similar  st  udy  of  this  magnitude  appears  to  have 
been  reported  previously.  The  properties  of  confidence 
regions  and  confidence  intervals  computed  using  the 
linearization,  likelihood,  and  lack-of-fit  methods  have 
been  analyzed  by  several  authors,  including  Jennrich 
(1050),  Beale  (1060).  Guttman  and  Meeter  (1965),  Gallant 
(1976),  Duncan  (1978),  and  Bates  and  Watts  (1980), 
While  the  literature  inrludes  numerous  warnings  regard¬ 
ing  the  possible  inaccuracy  of  the  approximate  methods, 
it  contains  little  empirical  data  to  illustrate  the  size  of 
the  discrepancies  between  observed  and  nominal  coverage 
that  might  be  expected.  In  those  studies  which  do  contain 
empirical  data  on  confidence  regions  and  intervals,  the 
largest  reported  differences  between  the  observed  and 
nominal  coverage  is  only  9^  for  a  95%  confidence  region 
computed  using  the  linearization  method,  and  is  even 
smaller  for  the  likelihood  method  (Gallant  ()976)J.  In 
many  practical  applications,  potential  differences  of  9% 
might  not  be  cause  for  concern.  Evidence  of  much  larger 
differences,  however,  would  indicate  the  need  for 
improved  methods.  Our  results  provide  such  evidence. 

Ouf  Monte  Carlo  study  has  .several  purposes.  First, 
we  wish  to  determine  whether  the  observed  coverage  of 
the  linearization  method  is  significantly  affected  by  bow 
the  variance-covariance  matrix  is  computed.  Second,  wc 
wish  to  determine  whether  the  approximate  confidence 
regions  and  ronfidenre  intervals  constructed  using  the 
linearization  and  likelihood  methods,  and  the  approxi¬ 
mate  confidence  Intervals  con.structed  using  the  lack-of-fit 
method  have  observed  coverage  significantly  different 
from  nominal.  In  particular,  we  want  to  know  whether 
the  frequently  used  linearization  method  is  significantly 
better  or  worse  than  the  more  expensive  likelihood  and 
lark-of-fil  mrlhod.s.  .Section  3  describes  how  wc  designed 
our  study  to  answer  these  questions.  The  results  are 
presented  anil  discussed  in  Section  4.  Wc  have  also  inves- 
tig.afed  how  effective  the  diagnostics  of  Bates  and  Watts 
(1980)  are  in  predicting  when  the  confidence  regions  pro¬ 
duced  by  the  linearization  and  likelihood  methods  should 
be  reliable;  this  part  of  the  study  is  the  subject  of  Section 
5. 

Our  study  is  oriented  toward  nonlinear  least  squares 
software  ilevclopers  who  need  assurance  that  the  methods 
they  implement  are  reasonable  for  a  wide  variety  of  prob¬ 
lems.  We  make  only  the  customary  assumptions  that  the 


model  is  rorrert  and  that  the  errors  are  norroally  distri- 
billed.  We  do  not  assume  that  we  can  change  the 
represent  at  ion  of  the  parameters,  e.g.,  by  reparameteris- 
ing  0  as  log(9),  in  order  to  reduce  the  difference  between 
the  observed  and  nominal  coverage,  bccau.se  repanameter- 
ization  is  not  a  technique  that  can  be  routinely  imple¬ 
mented  by  software  developers  who  have  no  control  over 
the  functions  analyzed.  Headers  interested  in  using 
repararneterization  to  improve  their  results  are  refered  to 
Hatkowsky  (1083). 

The  conclusions  we  draw  from  this  study  are 
presented  in  Section  6.  The  first  conclusion  is  (hat  among 
the  variants  of  the  linearization  method,  the  one  using 
is  the  best  choice  because  it  is  the  cheapest,  and  is  always 
at  least  as  reliable  as  the  other  two  variants  and  some¬ 
times  more  reliable.  The  second  conclusion  is  that  even 
the  best  linearization  method  can  be  very  poor; 
confiilence  regions  with  observed  coverage  as  low  as 
12.1^?  for  a  nominal  region,  and  confidence  intervals 
with  observed  coverage  as  low  as  75.0^  for  a  nominal 

interval  are  reported.  In  contrast,  for  each  of  the 
datasets  tested,  the  confidence  regions  and  confidence 
intervals  constructed  using  the  likelihood  method  and 
lack-of-fit  methods  are  quite  close  to  nominal.  Finally, 
our  stiiily  indicates  that  the  diagnostics  of  Hates  and 
Watts  (1080)  appear  quite  successful  at  predicting  when 
linearization  confi<leiice  regions  will  be  poor.  Our  recom¬ 
mendations  as  to  how  nonlinear  least  squares  software 
should  calculate  confidence  regions  and  intervals,  in  light 
of  these  conclusions,  also  are  given  in  Section  0. 

2.  Background 

This  section  briefly  discusses  methods  for  construct*  ‘ 
ing  confidence  regions  and  confidence  intervals.  First,  we 
give  a  very  quick  survey  of  confidence  regions  and 
confidence  intervals  for  linear  least  squares.  Next,  we 
de.scribe  the  two  different  ways  function  nonlinearity  can 
affect  the  solution  locus.  Then,  we  review  the  lineariza¬ 
tion.  likelihood,  and  laek-of-fit  methods  for  constructing 
fonfidrnce  region.s  and  confidence  intervals  when  the 
model  is  nonlinear.  For  a  more  complete  discussion,  see 
Hard  ( 1071),  Clallant  (1970),  Draper  and  Smith  (1981),  or 
Donaldson  ( 1985). 

Linear  least  squares 

When  /(x,:0)  is  linear  in  the  parameters  0,  then 
/(x,;0)  "  X,  0.  Consequently,  the  Jacobian  of  F(0)  is  X, 
an  II  by  p  matrix  with  row  x,.  If  we  assume  (hat  X  is 
of  full  rank,  then  X^X  is  nonsingular,  and  (he  linear  least 
squares  estimators  can  be  expressed  in  closed  form  by 

0  -  (X^X)-'  X^Y  . 

When  /V'(0,<y^  I),  a  100  (1  — a)^  confidence  region 
about  0  contains  tho.se  values  0  for  which 

5(8)5  ,=  p  (2.1) 

Fqnation  (2.1 )  is  eipiivalent  to 

(•-8)'-x''x(e-8)s  (2.2) 

for  all  linear  models,  which  shows  that  the  shape  of  the 
confidence  regions  about  0  is  ellipsoidal  for  all  linear 
models. 

A  100  (1  — 01*^0  confidence  interval  about  6^  contains 
those  values?^  for  which 


|8,-e,|5  .V(xrx)„-' (2.3) 

where  (X^X)^^”*  is  the  (/,;)**  element  of  the  inverse  of 
X^.  The  limits  of  this  confidence  interval  can  be  shown 
to  be  (Lose  values  fly  w  hich 

maximize  subject  to  (2.4) 

S(*)-S(8)=  »=(<„-,, . 

Nonlinearity  and  the  Solution  Locus 

The  solution  locus,  or  estimation  space,  of 

/(x,;0),  1^1 . n,  consists  of  all  points  with  coordinates 

expressible  as 

(/(x,;e),/(x,;e) . /(X,:*)) 

where  the  x,,  i*  1 . n,  are  the  fixed  values  of  the  predic¬ 

tor  variables,  and  0  is  allowed  to  vary  over  all  possible 
values  of  the  p  unknown  parameters.  The  solution  locus 
i.s  planar  if  there  exists  a  repararneterization  of  /(x,;0) 
that  makes  tlie  fiinrtion  linear  in  the  p  parameters.  Oth¬ 
erwise,  (he  solution  locus  is  curved. 

A  coordinate  grid  on  the  solution  locus  can  be 
formeii  by  tracing  the  paths  obtained  when  each  parame¬ 
ter  is  individually  allowed  to  vary  while  all  other  parame¬ 
ters  are  held  fixed.  The  coordinate  grid  is  curvilinear 
whenever  the  function  /(x,;0)  is  nonlinear  in  one  or  more 
of  its  parameters.  It  is  linear  only  when  the  function 
itself  is  linear. 

Ourvalure  of  the  solution  locus  is  called  "intrinsic" 
curvature  iHeuIe  (1900);  Bates  and  Watts  (1980)). 
Curvature  of  the  coordinate  grid  is  called  "parameter- 
effects"  or  simply  "parameter"  curvature  [Bates  and 
Watts  (1980)).  Intrinsic  curvature  is  not  affected  by 
rcparaineterizaiion.  Parameter-effects  curvature  is. 
l/mear  functions  have  zero  parametei-effects  curvature 
and  zero  intrinsic  curvature.  Nonlinear  functions  always 
have  nonzero  parameter-effects  curvature,  and  can  have 
either  zero  or  nonzero  intrinsic  curvature,  i.e.,  a  planar  or 
curved  solution  locus,  respectively. 

Nonlinear  Least  Squares 

When  the  function  is  nonlinear,  the  least  squares 
estimators  of  the  parameters  cannot  in  general  be 
expressed  in  closeil  form,  and  must  instead  be  computed 
by  iterative  terhniiiues.  Construction  of  exact  confidence 
regions  and  confidence  intervals  also  is  murh  more 
diliiciilt.  and  so  approximate  methods  are  frequently 
used.  The  leading  met  hods,  linearization,  likelihood,  and 
lack-of-fit.  are  described  briefiy  below. 

Linearization  methods.  Linearization  methods 
for  eonstnirting  confidence  regions  and  confidence  inter¬ 
vals  assume  that  the  nonlinear  function  ran  be  ade¬ 
quately  approximated  by  an  affine,  or  linear,  approxima¬ 
tion  to  the  function  at  the  solution.  That  is,  this  method 
assumes  that  (he  solution  locus  is  planar,  and  that  the 
cooriliiiate  grid  is  linear  throughout  the  area  to  be 
covered  by  the  eonliflence  regions  and  confidence  inter¬ 
vals.  tender  this  assumption,  linear  least  sqtiares  theory 
tells  us  that  the  ronfidence  region  about  0  consists  of 
those  values  0  for  which 

(•-§)'V''(e-8)5  p  F, 


while  a  confidence  interval  about  6^,  /*■  consists  of 

those  values  ? ^  for  which 

|l,-e,|  s  v,/'2 

where  V  is  the  estimated  variance-covariance  matrix  of 
the  parameters,  and  is  the  (/,/)**  element  of  V. 

Three  approximations  to  V  are  frequently  used. 
Tbe.se  are 

V.  -  ,^(j(e)’-j(e))‘'.  (A) 


V, (B) 

an<l 


V,  -  «-H(er'  (j(e)^j(j))H(e)-‘,  (c) 

where  J(6)  is  the  Jnrohian  of  F(0)  at  •;  H(6)  is  the  Hes¬ 
sian  of^.S'(0)  at  0;  and  s*  is  the  residual  variance, 
3-  *  — p  .  Approximation  (A)  is  the  most  com¬ 

mon  approximation  to  V,  and  is  the  direct  analog  from 
linear  least  squares  theory.  Approximation  (B)  can  be 
obtained  using  maximum  likelihood  theory,  and  can  be 
vieweil  as  using  observed  rather  than  expected  informa¬ 
tion  in  forming  the  variance-covariance  matrix.  Approxi¬ 
mation  (C)  is  obtained  by  using  a  quadratic  model  of 
5(0).  |Kor  a  more  detailed  discussion  of  these  variants, 
see  Bard  (1071)  or  Donaldson  (1985).]  When  certain  regu¬ 
larity  conditions  are  met  (Jennrich  (1959)],  these  approxi¬ 
mations  to  V  asymptotically  will  approach  the  true 
variance-covariance  matrix  of  the  model.  Note  also  that 
these  approximation  difier  only  when 


aV(x,-,e) 

ae,d«. 


is  nonzero.  In  pnrticular,  for  linear  functions,  each  of 
these  representations  of  Vis  equal  to 


/-'(Jiejj-Jie))''  -  ,J(x''x)-‘  . 


l/nieari7,ation  metho<ls  have  the  advantage  that  their 
resulting  confidence  regions  and  intervals  are  simple  and 
iiiexpousive  to  construct,  and  that  they  produce  bounded, 
convex  confulcnce  regions.  In  addition,  the  information 
needed  to  construct  confidence  regions  and  intervals  using 
this  method  can  be  parsimoniously  summarized  by  the  p 
by  p  matrix  V,  and  is  well  understood  by  users  familiar 
with  linear  least  squares.  Because  the  linearization 
methods  assume  that  both  the  intrinsic  curvature  and  the 
parameter-elTects  curvature  of  /(x,;0)  are  zero,  however, 
we  expect  that  the  linearization  methods  could  sometimes 
produce  observed  coverages  very  far  from  the  expected 
nominal  coverage.  The  results  of  our  Monte  Carlo  study 
show  t Ills  to  be  true. 


Likelihood  method.  The  likelihood  method  is 
another  approximate  method  for  producing  confidence 
regions  and  confidence  intervals.  The  likelihood  method 
confidence  region  about  0  consists  of  those  values  ^  for 
which 


5(0)-.9(i)s  . 

This  is  analogous  to  equation  (2.1)  for  confidence  regions 
for  the  parameters  of  a  linear  function,  although  when 
/(x,.0)  is  nonlinear  in  the  parameters  the  resulting 
confidence  region  is  no  longer  ellipsoidal.  The  likelihood 


method  confidence  interval  about  0^  is  the  interval 
bounded  by  the  points  which 

maximize  (0j"0;)^  subject  to 
9(0)- 9(0)  s 

This  confidence  interval  is  t)ie  projection  onto  the 
appropriate  parameter  axis  of  the  above  region,  and  is 
analogous  to  equation  (2.4)  for  confidence  intervals  in  the 
case  of  linear  least  squares. 

When  the  solution  locus  is  planar,  the  confidence 
regions  (but  not  the  confidence  intervals)  constructed 
using  the  likelihood  method  arc  exact.  In  addition,  likeli¬ 
hood  method  confidence  regions  and  intervals  have  the 
desirable  property  that  they  are  constructed  from  con¬ 
tours  of  con.stant  likelihood,  and  that  the  regions  and 
intervals  are  not  affected  by  reparameterizatioo  of  the 
function  /(x,:0).  Thus  we  might  expect  the  likelihood 
method  to  produce  confidence  regions  and  confidence 
intervals  with  observed  coverage  closer  to  nominal  than 
those  '>rodured  using  the  linearization  methods.  How¬ 
ever  lie  likelihood  method  has  several  practical  disad¬ 
vantages.  Both  the  confidence  regions  and  confidence 
intervals  produced  using  the  likelihood  method  can  be 
disjoint  and  unbounded  because  the  contours  of  a  non¬ 
linear  function  can  ))e  disjoint  and  unbounded.  The 
method  also  is  very  expensive  to  use,  and,  when  the  data 
arrays  are  large,  it  can  be  awkward  to  publish  the  infor¬ 
mation  nece.ssary  to  reconstruct  the  confidence  region 
because  this  information  is  not  succinctly  summarized  as 
it  is  in  the  case  of  the  linearization  method. 


Lack-oLfit  method.  The  lack-of-fii  method  can  be 
used  to  proiluce  exact  joint  confidence  regions  for  all  p  of 
the  parameters,  and  to  produce  approximate  confidence 
intervals  and  confidence  regions  for  subsets  of  the  param¬ 
eters.  An  exact  100  (1-0)?^  confidence  region  consists  of 
all  values  0  such  that 


where 


R(«  )’•(!- P(«))R(i) 


Note  that  the  lark-of-fit  method  dors  not  require  that  the 
least  squares  solution  be  found  prior  to  constructing  the 
coiifi<lence  region.  Similarly,  a  confidence  interval  for  the 
parameter  consists  of  those  values  for  which  there 

exists  values  of  . /■"  1 ,/+ 1  .....p,  such  that  for 

these  p  parameter  values, 

where  ^  |^))  is  the  residual  sum  of  squares  obtained 

when  R(0)  is  iinearly  fit  to  all  the  rolumns  of  j(i) 
excluding  the  and  5^(0j(y))  is  the  residual  sum  of 
squares  obtained  when  R(0)  is  linenrly  fit  to  j(i).  Thi. 
interv,al  is  exact  if  /(x,:0)  is  linear  in 
0*.  i..  .J'~  I ./+  1 . p;  otherwise  it  is  approximate. 


The  lark-of-fit  met  bod  is  even  more  expensive  to  use 
than  the  likelihood  metliorl,  and,  as  is  the  c.ise  for  the 
likelihood  method,  the  information  needed  to  consiruct 
the  confiilenre  regions  cannot  be  suecinetly  summarized 


1^  WW !fl  ‘,v iy\jvrwvwjv\/ii  ir* v^  vn^viyf 
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for  puhiicatioa.  Also»  the  coofidence  regtoos  and 
confidence  intervals  constructed  using  the  lack'of-fit 
method  are  guaranteed  to  contain  every  minimum,  max¬ 
imum.  and/or  saddle  point  of  the  likelihood  surface.  This 
makes  the  lack-of-Gt  method  structurally  undesirable. 


3.  The  Monte  Carlo  Study 


This  section  briefly  describes  how  our  Monte  Carlo 
study  was  constructed.  Full  details  are  provided  by 
Donaldson  (1985). 

The  Monte  Carlo  method  uses  the  computer  to  simu¬ 
late  (he  results  of  repeating  an  experiment  many  times  in 
order  to  obtain  a  large  sample  from  which  the  statistical 
properties  of  a  system  can  be  examined.  For  each  simula¬ 
tion.  we  first  generated  the  errors  and  response  variables. 
The  errors,  e.  were  produced  using  the  Marsaglia  and 
Tsang  pseiulo-norma)  random  number  algorithm  (1984)  as 
implemented  by  James  Dlue  and  David  Kahanar  of  the 
National  Bureau  of  Standards  Scientific  Computing  Divi> 
sion.  The  response  variable,  Y,  was  then  constructed  so 


that  its  i'*  component  is 


y.  -  /(x.;t)+e.  . 


Then  the  least  s<|uares  estimate,  6,  was  calculated  using 
NL2SOli.  an  unconstrained  <iuasi-Ncwton  code  for  non¬ 
linear  least  squares  [Dennis,  Gay,  and  Welsch  (1981)]. 
.Starting  values  for  NL2SOL  were  set  to  the  true  values  of 
the  parameters.  •,  and  the  stopping  criteria  for  the  con¬ 
vergence  tests  based  on  the  relative  change  in  the  param¬ 
eters  aiul  in  the  sum  of  squares  both  were  set  to  10’"*. 

Finally,  for  each  confidence  region  or  interval 
method  and  each  derivative  configuration  being  analyzed, 
we  recordcil  whether  the  true  values  of  the  parameters 
were  contained  within  the  confidence  regions  and 
confi<lence  intervals  for  this  realization  of  the  data. 
Detennitiing  whether  the  true  parameter  values  lay 
within  the  confidence  regions  and  confidenre  intervals 
about  the  least  s<|uares  estimates  fortunately  did  not 
require  tliat  we  construct  the  full  confidence  regions  and 
confid<Mice  intervals  for  each  confidence  level  and  method. 
Instead,  we  simply  calculated  the  smallest  confidence 
level.  I-w.such  that  a  I00(l“«)^  confidence  region  or 
ronfidenre  interval  constructed  using  the  method  being 
analyzed  will  contain  the  true  parameter  values.  When 
w>a,  the  true  value  did  not  lie  in  the  100(1 -a)^ 
conPulence  region  or  confidence  interval;  when  wSa,  it 

The  values  l-«  were  obtained  using  the  hypothesis 
tests  corresponding  to  the  formulas  for  confidence  regions 
and  intervals  given  in  Section  2,  and  the  appropriate 
cumulative  distribution  functions;  the  procedures  arc 
described  in  detail  in  Donaldson  (1985).  The  cumulative 
distribution  functions  were  obtained  from  the  STAHPAC 
subprogram  library  [Donaldson  and  Tryon  (1983)]. 

The  observed  coverage,  y^.  for  the  particular  nomi¬ 
nal  confidence  level,  method  and  system  under  analysis  is 
the  percentage  of  the  total  number  of  realizations  of  the 
data.  iV.  for  which  w  S  o.  When  /V  is  large,  the  standard 
deviation  of  can  be  approximated  using  the  normal 
approximation  to  the  binomial  di.stribulion.  In  this  study 
we  used  ,V*r>00.  so  (he  maximum  standard  deviation  of 
the  observed  coverage  at  any  coverage  level  is  approxi¬ 
mately  2.2^V. 


Note  that  substituting  a  new  realization  of  the  data 
for  one  which  could  not  be  completely  analyzed  because 
either  (a)  the  nonlinear  least  squares  algorithm  did  not 
converge,  or  (b)  (he  test  statisties  could  not  be  computed 
for  every  method  being  analyzed,  is  a  form  of  censoring 
which  will  bias  the  observed  coverages  obtained.  In  our 
analysis,  we  adjusted  the  value  of  &  for  each  dataset  so 
(hat  every  realization  could  be  completely  analyzed,  and 
therefore  the  re.sults  reported  in  this  paper  are  not 
derived  from  censored  data. 

We  com  pul  ed  the  observed  coverage  for  four  nomi¬ 
nal  confidence  levels,  0.50,  0.75,  0.95,  and  0.99.  In  this 
paper  we  only  include  our  data  for  the  level  0.95, 
although  we  comment  briefly  in  Section  4  on  our  results 
at  the  other  levels.  Data  for  the  full  study  are  given  in 
Donaldson  (1985). 

The  references  for  the  datasets  used  in  our  Monte 
(’arlo  study  are  given  in  Appendix  A  and  described  in 
detail  in  I^naldson  (1985).  With  only  two  exceptions, 
(he  functions  and  data  which  comprise  our  datasets  have 
been  taken  from  Ratkowsky  (1983),  Ilimmelblaii  (1970), 
Gultmaii  and  Meeler  (1965),  and  Duncan  (1978).  The 
standard  deviation  of  the  errors  of  some  of  the  datasets 
has  been  adjusted  in  order  to  allow  us  to  successfully 
analyze  each  realization  of  the  data  for  each  dataset.  The 
two  <lalasr(s  not  (aken  from  the  published  literature  are 
identified  as  8AC'A  and  OAAG.  Dataset  8ACA  was 
creatc<l  o.specially  for  this  study  by  generalizing  function 
3  to  a  larger  number  of  parameters.  Dataset  9AAG 
involves  a  microwave  absorption  line  function  taken  from 
a  consulting  session  at  the  National  Bureau  of  Standards 
in  Boubler.  Colorado. 

The  number  of  parameters  In  the  20  datasets 
analyzed  range  from  2  to  8  and  the  ratio  of  the  number  of 
parameters  to  the  number  of  observations  range  from 
2/(2  to  3/5.  While  these  datasets  arc  often  troublesome, 
they  are  mostly  real  world  problems  that  have  not  been 
made  artificially  difficult. 

Fach  dataset  was  analyzed  twice  to  allow  us  to 
examine  the  effect  of  increasing  the  standard  deviation  of 
(he  errors.  In  the  first  analysis,  e  —  N(0,d^  I);  in  the 
second  analysis,  e  A'(o,(*rt  d)‘l),  where  t]  is  approxi¬ 
mately  the  largest  number  S  10  for  which  every  realiza¬ 
tion  of  the  data  could  be  successfully  analyzed.  The 
methods  analyzed  in  the  second  analysis  were  the  same  as 
in  the  first  except  that  variants  B  and  C  of  the  lineariza¬ 
tion  method  were  excluded  from  the  second  analysis 
because,  when  ‘ri>l.0,  we  were  frequently  unable  to  com¬ 
pute  the  recjuired  test  statistics  using  these  two  variants. 

Compulation  of  the  linearization  method  and  the 
lark-of'fit  method  requires  that  certain  derivatives  be 
available.  The  Jacobian  of  F(9)  is  used  by  both  the 
linearization  and  lack-of-fil  methods.  Variants  B  and  C  of 
the  linearization  method  use  the  Hessian  of  .5(6)  as  well. 
In  practice,  analytic  drrivative.s  often  are  not  available. 
Therefore,  in  our  study  each  method  was  implemented 
an<l  analyzed  using  three  different  <lerivative 
configurations.  These  configurations  are  (1)  the  Jacobian 
and  Hessian  Iwth  approximated  by  finite-dlfferrnres.  (2) 
the  Jarobiai)  rompiilrd  analytically  and  the  Hessian  com¬ 
puted  by  finile-tlilTrrencrs.  and  (.3)  both  the  Jacobian  and 
the  Hessian  cnmpiiled  analytically.  For  derivative 
configurations  (I)  and  (2).  the  variance-covariance  matrix 
needed  by  the  linearization  method  was  returned  directly 
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from  NI/2SOL.  For  configuration  (3),  it  was  constructed 
oulside  of  NL2SOL.  For  details  on  the  formulas  used  to 
compute  the  finite-difference  derivative  approximations, 
see  Donaldson  (1985). 

Wo  ran  our  Monte  Carlo  -^tudy  in  single  precision  on 
a  60  bit  word  length  computer.  All  subroutines  extracted 
from  other  sources  were  used  without  modification  except 
for  NI..2SOL,  which  was  changed  for  this  study  in  two 
important  ways.  First  we  disabled  the  two  tests  within 
NL2S()L  used  to  detect  near  singularity.  Second,  we  used 
the  ST,Ain*AC  front  end  to  NL2SOL.  With  this  front 
end.  the  fiuite  difference  approximation  to  the  Jacobian  is 
computed  with  the  optimal  derivative  step  sizes  selected 
using  the  algorithm  developed  by  Schnabel  (1981),  thus 
maximizing  the  number  of  correct  digits  in  each  element 
of  the  finite  difference  Jacobian. 

4.  Results  and  Observations 

1'his  section  presents  the  results  of  our  Monte  Carlo 
study  of  the  lack-of-fit  method,  the  likelihood  method, 
and  the  three  variants  of  the  linearization  niethod.  The 
seclion  is  divider]  into  a  discussion  of  confidence  regions 
aiirl  confirlence  intervals.  For  each,  we  also  make  a 
nuinhrr  of  observations  about  the  results.  The  conclu¬ 
sions  wo  draw  from  our  analysis  are  discussed  in  the  next 
chapter. 

’rhe  material  in  this  chapter  includes  a  number  of 
figures.  Those  are  printed  at  the  end  of  the  paper. 

Confidence  Regions 

Results.  The  results  for  nominally  95^  confidence 
regions  constructed  using  each  of  the  methods  analyzed  in 
this  study  with  A'(0,d*  I)  are  graphically  displayed  in 
Figure  1.  For  each  dataset,  the  observed  coverage  is  plot¬ 
ted  against  the  method  and  derivative  configuration  used 
to  obtain  it. 

'I'ho  three  derivative  configurations  are  labeled  DC), 
Df  ^.  and  I)C*3  in  these  and  the  following  figures  and 
tal)lrs.  as  well  as  in  Appendix  H,  Here  DCl  denotes  use  of 
fiuite  difference  approximations  for  both  the  Jacobian  and 
the  Hessian,  DC'2  <lcnotes  use  of  analytic  Jacobian  and 
finite  difference  Hessian,  and  DC3  denotes  use  of  analytic 
.l.acohian  and  Hessian.  Since  the  computations  required 
to  calculate  the  lack-of*fil  method  results  and  the  likeli¬ 
hood  method  results  using  derivative  eonfigtirat ions  D(’2 
an<!  Dt'.'l  are  exactly  the  same,  these  results  are  <lisplayed 
toget  her. 

Figtire  2  shows  the  analogous  results  for 
c— ,\(o,(’n  ct)*  I).  As  noted  in  Section  3,  variants  H  and 
C  of  the  linearization  method  are  excluded  from  the 
analysis  displayed  in  Figure  2  because  computational 
dilliciillies  were  enrountcrod  for  these  variants  when  the 
variance  of  the  errors  was  increased. 

A  conservative  95*"^  confidence  interval  about  the 
nominal  confidenre  level  is  indicated  on  each  plot  by  a 
pair  of  horizontal  lines  which  represent  the  values 
100  1 1  -  a )±  1.1.  where  1.1  is  two  times  (he  maximum 
observed  roverage  at  any  coverage  level.  This  eonfidence 
interval  provides  a  fpiick  means  of  determining  whether 
any  of  the  observed  coverages  for  each  methofl  are 
significantly  different  from  the  nominal  confidence  level 
at  the  level.  When  the  method  used  to  construct  (he 


confidence  regions  and  confidence  intervals  is  exact,  we 
expect  that  the  observed  coverage  for  95^  of  all  possible 
datasets  will  lie  within  this  confidence  interval. 

Observations.  Figures  1  and  2  show  that  the 
lack-of-fil  and  likelihood  method  confidenre  regions  are 
quite  reliable,  and  that  the  results  are  not  affected  by  use 
of  finite  difference  derivatives.  In  all  our  tests,  they  pro¬ 
duced  observed  coverages  which  seldom  vary  from  nomi¬ 
nal  by  an  amount  that  is  significant  at  the  5%  level.  In 
fact,  for  these  datasets,  there  is  only  one  instance 
(dataset  3AAA.  e  --  the  difference 

between  the  nominal  and  observed  coverages  produced 
using  these  (wo  methods  is  greater  than  b%,  and  in  this 
instance,  the  observed  coverage  is  greater  than  nominal, 
not  less. 

The  three  variants  of  the  linearization  method,  on 
the  other  hand,  frequently  produced  far  less  reliable 
confidence  regions,  although,  as  discussed  below,  the 
results  still  do  not  appear  to  be  affected  by  the  use  of 
finite-difference  derivatives.  The  difference  between  the 
nominal  and  observed  coverages  obtained  using  the 
linearization  methods  often  are  considerably  more  than 
20^f.  which  is  a  difference  that  many  if  not  most  users 
would  find  unacceptable. 

Hy  comparing  Figure  1  to  Figure  2,  it  is  apparent 
that,  increasing  the  variance  of  the  errors  does,  in  fart, 
increase  the  differences  between  observed  and  nominal 
coverage  for  all  methods.  Our  tests  at  confidence  levels 
0.50.  0.75,  and  0.99,  which  are  not  reported  in  detail  here, 
also  showed  (hat  the  spread  between  the  observed  and 
nominal  coverage  obtained  using  the  linearization  method 
increases  as  the  nominal  confidence  level  is  increased. 

The  large  differences  for  some  datasets  between  the 
observed  coverage  of  confidence  regions  constructed  using 
the  likehhootf  method  and  those  obtained  using  the 
linearization  method  may  be  explained  by  the  difference 
in  the  shape  of  the  two  regions.  The  likelihood  method 
confidence  region  corresponds  to  the  boundary  and  inte¬ 
rior  of  a  contour  of  t  he  sum  of  squares  surface,  i.e.,  a  con¬ 
tour  of  constant  likelihood,  whereas  the  linearization 
method  confidence  regions  are  always  ellipsoidal.  We 
plotted  these  contours  for  various  datasets,  and  the 
difference  sometimes  were  very  large.  Examples  for 
data.srt.s  3AAA  and  J1AAG  are  given  in  Donaldson 
(1985). 

Figure  1  also  indicates  that  the  observed  coverage 
obtained  using  variants  A,  B,  and  C  of  the  linearization 
method  are  nearly  identical.  The  results  of  two-sided 
paired-sample  (-tests  indicate  that  there  is  no  statisti¬ 
cally  .sigifffirant  differences  at  the  bfe  level  between  the 
observed  coverages  obtained  using  any  of  the  variants  of 
the  linearization  method  with  any  of  the  derivative 
configiiralion.s.  The  same  results  were  obtained  for  our 
tests  at  the  0.50,  0.75,  and  0.99  confidence  levels. 

Confidence  Intervals 

Results.  Figures  3  and  1  provide  information  for 
confidence  intervals  which  is  analogous  to  that  shown  in 
figures  I  an<l  2  for  eonfidence  regions.  The  observed  cov¬ 
erages  plotted  are  (he  nmnlle/it  of  the  p  eonfidenie  inter¬ 
val  coverages  obtained  for  each  dataset.  Figure  3  displays 
the  observed  confidence  interval  results  for  nominally 
95^f  conPHlenre  levels,  w  hen  e*“  A'(0,<T‘ 1);  figure  4  shows 
the  results  when  e*"  A'(0,(ti  d)‘ l),  excluding 


linearizal ioD  method  variants  B  and  C  as  was  done  for 
the  liorarirai^oD  method  confidence  regions. 

Observations.  Figure  3  shows  that  for  confidence 
intervals,  the  best  results  are  obtained  using  the  lark-of- 
fit  and  likelihood  methods,  and  the  worst  results  are 
obtained  using  the  linearization  method,  as  was  the  rase 
for  confidence  regions.  The  lack-of-fit  and  likelihood 
methods  produce  confidence  intervals  which  seldom  vary 
from  nominal  by  an  amount  that  is  significant  at  the  5% 
level,  and  never  are  less  than  nominal  by  more  than 
5.0^.  Again,  use  of  finite  difference  Jacobians  does  not 
appear  to  affect  the  results  for  these  two  methods. 

The  three  variants  of  the  linearization  method,  on 
the  other  hand,  frequently  produce  far  less  reliable 
confidence  intervals  than  the  lack-of-Gt  and  likelihood 
methods.  Disturbing  differences  between  observed  and 
nominal  coverages  occur  when  each  of  the  variants  of  the 
lineariiation  method  is  used  to  construct  confidence  inter¬ 
vals.  The  observed  coverage  for  a  nominally  05% 
confidence  interval  is  as  low  as  75.0%,  'l'1.0%,  and  10.8% 
for  variants  A,  B.  and  C.  respectively.  For  most  of  the 
datasets  tested  in  our  study,  however,  the  span  between 
observed  and  nominal  coverage  produced  by  the  three 
variants  of  the  linearization  method  is  considerably  less 
for  confidence  intervals  than  for  linearization  method 
confidence  regions  constructed  about  the  parameters  of 
the  same  dataset.  This  is  especially  true  when  derivative 
configurations  DC2  and  DC3  arc  used. 

One  reason  why  linearization  method  confidence 
intervals  have  belter  coverage  than  linearization  method 
confidence  regions  is  that,  when  the  parameter  estimates 
are  correlated  with  each  other,  a  number  of  points  may  be 
included  in  the  linearization  method  confidence  intervals 
but  not  in  the  confidence  regions.  Note,  however,  that  if 
a  confidence  interval  was  computed  for  the  linear  combi¬ 
nation  of  the  parameters  given  by  the  eigenvector 
corre5pon<lirig  to  the  minor  axis  of  the  linearization 
method  confidence  region  ellipsoid,  then  the  linearization 
method  confidence  interval  observed  coverage  should 
approximately  e<iual  that  of  the  linearization  method 
confidence  region.  In  our  Monte  Carlo  study,  we  actually 
computed  the  linearization  method  confidence  interval 
observed  coverage  for  this  linear  combination  of  the 
parameters.  In  every  case,  the  observed  coverage  we 
obtained  for  the  confidence  interval  about  this  linear 
combination  was  approximately  equal  to  that  of  the 
linearization  method  confidence  region  observed  coverage. 

The  use  of  finite  differences  to  approximate  both  the 
.Incobian  and  the  Messian  appears  to  significantly  degrade 
the  confidence  interval  results  for  linearization  variants  B 
and  C.  Figure  3  shows  that,  while  there  is  no  striking 
difference  in  the  results  obtained  using  the  three  variants 
of  the  linearization  method  with  derivative  configurations 
D(’2  and  DC3,  variants  B  and  C  degrade  significantly 
more  than  variant  A  when  using  DCI,  i.e.,  finite 
difTerence  .lacobian  and  Messian.  A  two-sided  paired- 
sample  t-test  was  used  (o  determine  wh'ther,  for  a  given 
derivative  configuration,  the  observed  coverages  obtained 
using  the  <lifTerent  linearization  method  variants  are  sta¬ 
tistically  (fifferent  at  the  .'5%  significance  level.  The 
results  ituKcate  that  when  d^'rivative  configuration  DC2 
and  fX'.3  are  used,  the  differences  in  the  re.sults  obtained 
using  variants  A,  B.  and  C  are  seldom  statistically 
significant  at  the  level,  but  that  when  the  .Jarobian 
and  Hessian  are  approximated  using  finite  differences 


(derivative  configuration  DCI)  then  the  differences  in 
results  are  often  significant. 

Comparing  Figures  3  and  4  shows  that  as  the  vari¬ 
ance  of  the  errors  is  increased,  the  differences  between 
observed  and  nominal  coverage  also  are  increased,  as  was 
the  case  for  the  confidence  region  results.  However,  this 
increase  is  not  as  pronounced  for  confidence  intervals  as 
for  confidence  regions.  The  results  at  confidence  levels 
0.50,  0.75,  0.05,  and  0.09  also  showed  that  as  the  nominal 
confidence  level  approaches  100%,  the  spread  between 
observed  and  nominal  coverages  obtained  using  the 
linearizatioD  method  is  increased. 

5.  Diagnostic  tools 

The  preceding  section  demonstrates  a  pressing  need 
for  diagnostics  to  warn  users  when  the  commonly  used 
linearization  method  confidence  region  will  not  have  ade¬ 
quate  coverage.  In  addition,  it  would  be  useful  to  have  a 
warning  to  indicate  when  the  approximate  likelihood 
metiio<i  may  be  inadequate  Bates  and  Watts  (1080)  have 
proposed  measures  of  nonlinearity  that  provide  such  diag¬ 
nostics. 

According  to  Bates  and  Watts,  when  their  relative 
measure  of  parameter  effects  curvature  is  small  compared 
to  the  critical  value  [f'p  ^  o  f>^)” then  the  linear  coor¬ 
dinate  grid  assumption  is  valid  over  the  region  of  interest, 
and  therefore  the  linearization  method  confidence  region 
should  be  adequate.  Similarly,  when  their  relative  meas¬ 
ure  of  intrinsic  curvature  is  small  compared  to  the  same 
critical  value,  then  the  assumption  that  the  solution  locus 
is  planar  is  valid  over  the  region  of  interest  and  therefore 
the  likelihood  method  confidence  region  vhould  be  ade¬ 
quate. 

In  Figure  5  we  plot  the  20  confidence  region  observed 
coverages  obtained  using  linearization  method  variant  A 
with  analytic  derivatives  (derivative  configuration  DC3) 
and  e'-/v(o,|'q  against  the  Bates  and  Walls  rela¬ 

tive  measure  of  parameter  effects  curvature.  Likewise,  in 
figure  6  wc  plot  the  corresponding  20  likelihood  method 
confidence  region  observed  coverages  against  the  Bates 
and  Watts  relative  measure  of  intrinsic  curvature.  The 
relative  curvature  mea.^iires  were  computed  at  the  true 
parameter  values  using  tlie  true  variance  of  the  errors  In 
these  plots,  we  have  scaled  the  measures  of  parameter 
effects  curvature  and  intrinsic  curvature  by  dividing  the 
measure  by  the  appropriate  critical  value.  Thus,  in  both 
of  these  plots,  a  scaled  curvature  measure  less  than  I 
indicates  the  relative  measure  was  less  than  the  critical 
value,  while  a  value  greater  than  1  indicates  the  curva¬ 
ture  exceeded  the  critical  value. 

It  is  clear  from  figure  5  that  the  Bates  and  Watts 
parnmrtrr  effects  curvature  measure  Is  strongly  correlated 
with  the  observed  coverage  obtained  using  the  lineariza¬ 
tion  method.  In  fart,  for  our  data  as  the  parameter 
effects  curvature  increases,  the  observed  coverage  for  the 
linearization  method  confidence  regions  decreases  nearly 
monntonically  and  linearly  as  the  logarithm  of  the  scaled 
parameter  effects  curvature.  Furthermore,  in  all  datasets 
where  the  parameter  effects  curvature  is  less  than  the 
critical  value,  the  observed  confidence  region  is  very  close 
to  noniinal.  while  in  all  eases  where  the  parameter  effects 
curvature  is  greater  than  ten  times  the  critical  value,  the 
observed  coverage  is  unsatisfactorily  low.  Datasets  with 


parameter  effects  curvature  between  one  and  ten  times 
the  critical  value  had  observed  confidence  region  coverage 
between  83.2*^  and  91.6*^.  From  these  results,  it  appears 
that  the  Itates  and  Watts  parameter  effects  curvature  is  a 
reliable,  if  perhaps  stringent,  indicator  of  when  the 
linearization  method  will  produce  reliable  confidence 
regions. 

Figure  6  shows  that  all  but  one  of  the  20  datasets 
tested  in  this  study  have  intrinsic  curvature  which  is  less 
than  the  critical  value,  which  means  that  each  of  these 
datasets  is  nearly  planar.  For  nearly  planar  datasets  we 
expected  good  observed  coverage  from  the  likelihood 
method,  and,  as  figure  6  shows,  that  is  what  we  got.  Since 
none  of  our  datasets  have  high  intrinsic  curvature,  how¬ 
ever,  wc  do  not  know  how  the  likelihood  method  will  per¬ 
form  when  the  solution  locus  is  not  nearly  planar.  Wc 
cannot  assume  that  the  accurate  results  obtained  in  our 
study  using  the  likelihood  method  will  necessarily  carry 
over  to  datasets  with  large  intrinsic  curvature. 

C'ook,  Tsai  and  Wei  (1981)  provide  an  example 
which  has  scaled  parameter  effects  curvature  of  931.5  and 
scaled  intrinsic  curvature  of  8.1.  noth  the  parameter 
effects  curvature  and  intrinsic  curvature  of  this  dataset 
excee<l  any  curvature  measure  we  observed  in  the  20 
datasets  in  our  study.  For  this  dataset,  we  computed 
observed  confidenro  region  coverages  of  19.0^  and  95.0% 
using  the  linearization  method  and  likelihood  methods, 
respectively.  While  the  linearization  method  confidence 
region  observed  coverage  is  very  far  from  nominal  as  wc 
would  expect  based  on  the  parameter  effects  curvature  of 
this  model,  the  likelihood  method  confidence  region 
observed  coverage  is  not.  W'c  cannot  conclude  anything 
from  this  one  observation.  It  is  clear,  however,  that  addi¬ 
tional  analysis  of  datasets  with  high  intrinsic  curvature 
would  be  useful  to  further  assess  the  effect  of  a  non- 
planar  solution  locus  on  the  likelihood  method. 

0.  Conclusions 

llased  on  our  computational  study,  wc  can  draw  con¬ 
clusions  about  :  i)  the  comparison  between  the  three 
variants  of  the  linearization  method;  ii)  the  reliability  of 
linearization  methods  for  calculating  confidence  regions 
an<I  confidence  intervals;  and  iii)  the  reliability  of  the 
likelihood  and  lack-of-fit  methods  for  calculating 
confidence  regions  and  confidence  intervals. 

W  hen  using  the  linearization  metlio<l  to  construct 
confidence  regions  and  intervals,  our  Monte  ('arlo  study 
ha^*  vhow  n  no  ciearcut  difference  in  the  observed  coverage 
of  one  variant  as  compared  to  another.  In  our  tests,  the 
only  si  at  ist  ically  significant  difference  among  the  results 
prodficed  by  the  three  linearization  variants  was  in  con¬ 
structing  confidence  intervals  with  finite  difference  .laco- 
biaiis  and  Hessians;  here  variant  A  was  superior  to  vari¬ 
ants  H  and  W'e  found  no  empirical  evidence  that  one 
shouh!  prefer  variants  H  or  C^  even  though  they  may  be 
appealing  from  a  llieoretical  point  of  view.  Therefore  we 
coiicltide  that  variant  A  of  the  linearization  method, 
w  hich  is  computed  using 

V,  = (0.1) 

is  1  hr  best  variant  to  use  for  constructing  both  confidence 
regions  and  ronfidence  intervals,  because  it  is  simpler, 
(ess  expensive,  and  more  nnmerirally  .stable  to  compute 


than  variants  B  or  C,  which  use 

V*  «  (8.2) 

and 

V,  -  H(»)-'  (j(i)^J(i))  H(*)->  ,  (6.3) 

refipcctive)}’.  Variant  A  is  simpler  and  less  expensive 
because  it  only  requires  the  Jacobian  of  the  model  func¬ 
tion  &l  the  solution  and  not  the  additional  second  order 
terms  that  are  also  required  to  form  the  Hessian.  It  is 
more  stable  because  it  can  be  formed  by  inverting  the 
upper  triangular  factor  R  of  the  QR  factorization  of  the 
Jacobian  rather  than  by  calculating  the  inverse  of  the 
Hessian;  the  former  calculation  can  be  expected  to  lose 
roughly  half  as  many  digits  as  the  latter  in  finite  precision 
arithmetic. 

The  linearization  method  is  not  always  an  adequate 
method  for  approximating  confidence  regions  and 
confidence  intervals  for  the  parameters  of  a  ooolinear 
model,  however.  The  results  presented  in  the  preceding 
section  show  just  how  poor  the  linearization  method  can 
be  in  some  cases.  Although  there  are  many  examples 
where  the  linearization  method's  observed  coverage 
differs  from  nominal  by  only  a  very  small  amount,  there 
are  also  many  cases  where  the  observed  coverage  is  far 
lower  than  the  nominal.  In  onr  tests,  (he  best  lineariza¬ 
tion  method  variant.  A,  produced  observed  coverages  as 
low  as  12.1%  for  nominal  9r>%  confidence  regions  and 
75.0%  for  nominal  95%  ronfidence  intervals. 

Users  will  continue  to  use  the  linearization  method, 
however,  because  it  is  readily  available  in  software  pack¬ 
ages  and  provides  a  concise  representation  of  the 
information  needed  to  construct  confidence  regions  and 
intervals.  'I'he  erratic  results  obtained  in  our  study  when 
using  the  linearization  method  lead  us  to  conclude  that 
users  of  nonlinear  least  squares  software  must  be  helped 
to  cautiously  assess  the  results  they  obtain  using  the 
linearization  method.  The  results  of  the  preceding  section 
.show  that  the  diagnostic  tool.s  proposed  by  Bates  and 
Watts  ( 1080)  are  very  successful  in  indicating  cases  where 
the  linearization  method  confidence  regions  are  likely  to 
he  unreliable.  In  these  rases,  more  reliable  methods,  such 
as  the  likelihood  or  lack-of-fit  methods,  are  required  to 
produce  aecurale  confidence  regions  or  intervals. 

Our  study  shows  that  the  lack-of-fil  and  likelihood 
methods  both  produce  observed  coverages  acceptably 
close  to  nominal  in  every  test  case.  Although  the 
diflirulties  and  expense  associated  with  using  these  two 
nietliofls  to  roriipule  confidence  regions  m.ake  it  unlikely 
that  they  will  ever  routinely  replace  the  commonly  used 
linearization  method  for  this  purpose,  they  appear  to  be  a 
reliable  alternative  that  should  be  considered  when  diag¬ 
nostics  show  that  linearization  confidence  regions  are 
unreliable.  It  is  not  as  difficult  and  expensive  to  con¬ 
struct  confidence  intervals  using  the  lack-of-fit  or  likeli¬ 
hood  niet|>o<ls.  and  we  believe  that  producers  of  nonlinear 
least  stpiares  software  should  consider  this  possibility. 
(Constructing  these  intervals  requires  the  solution  of  a 
series  of  iionlinearly  constrained  optimization  problems; 
it  may  be  necessary  lo  construct  special  purpose  software 
to  solve  t hc'se  j^roblrms  a.s  rffirirnlly  as  po.ssibir.)  Per¬ 
forming  hypothesis  (esis  using  the  likelihood  or  lack-of-fit 
methods  is  roinpntationaliy  simple  for  both  confidence 
regions  am!  intervals,  so  we  recommend  that  one  of  these 
two  methods  be  employed  for  hypothesis  tests  whenever 
possible. 


Tscrs  may  prefer  the  likelihood  method  to  the  lack* 
of*nf  method  even  though  it  ia  approximate  aod  the 
lark'of-nt  method  \n  exact,  because  the  likelihood  method 
has  more  desirable  structural  characteristics  than  the 
lack-of-fit  method.  Our  study  provides  no  empirical  evi* 
dence  (hat  (he  results  produced  by  the  likelihood  method 
are  inferior  to  those  produced  b3’  the  l.ack‘of-fit  method. 
This  does  not  guarantee  that  similar  results  will  be 
obtained  on  other  datasels,  however.  In  particular,  the 
results  of  the  diagnostic  test  proposed  by  Hates  and 
Watts  showed  that  all  our  datasets  have  low  intrinsic  cur¬ 
vature.  which  is  precisely  the  situation  when  likelihood 
methods  are  expected  to  be  very  reliable.  The  additional 
dataset  we  analysed  with  high  intrinsic  curvature  also 
produced  likelihood  method  confidence  region  observed 
coverage  close  (o  nominal.  Additional  analysis  is  required 
to  determine  whether  the  likelihood  method  is  reliable  for 
datasets  with  high  intrinsic  curvature,  and  to  determine 
whether  the  Hates  and  Watts  measure  of  intrinsic  curva¬ 
ture  is  a  useful  tool  for  indicating  when  the  likelihood 
method  confidence  reglon.s  are  likely  to  be  unreliable. 

In  addition  to  diagnostics,  it  appears  that  there  is  a 
need  for  new  methods  for  estimating  confidence  regions 
that  are  both  reliable  and  easy  to  report  We  are  espe¬ 
cially  interested  in  investigating  two  methods  that  would 
result  in  conservative  elliptical  confidence  regions.  The 
first  method  is  to  find  the  minimal  inagnificattoo  of  the 
linearir.ation  confidence  region  that  encloses  the 

)  likelihood  or  lack-of-fit  confidence  region.  This 
would  re(|uirc  tlie  solution  of  a  constrained  opt imir.ation 
problem  with  one  nonlinear  eqtiality  constraint.  The 
second  method  is  to  find  (he  .smalle.sl  volume  ellipse  that 
encloses  the  desired  likelihood  or  lack-of-fit  confidence 
region.  1’his  would  re<|uire  the  solution  of  a  semi-infinite 
Mrograinining  poblem,  i.e.  an  optimization  problem  with 
an  infinite  sot  of  constraints. 

7«  Summary 

We  have  presented  the  results  of  a  Monte  Carlo 
study  comparing  the  linearization,  likelihood  and  lack- 
of-fit  ineiliods  for  fon.strurting  confidence  regions  and 
confidence  infrrval.s.  Our  results  iudicale  that  the  lineari¬ 
zation  inelliod  should  be  constructed  using  the  simplest 
approximation  to  the  variance-covariance  matrix,  (6.1), 
as  it  is  simpler,  less  expensive,  more  numerically  stable, 
and  at  least  as  accurate  as  the  other  two  l(tiearization 
variants,  which  are  constructed  using  (0.2)  and  (0.0).  We 
have  also  given  considerable  evidence  that  confidence 
regions,  and  to  some  extent  confidence  intervals,  con¬ 
structed  using  the  linearization  method  can  be  essentially 
meaningless. 

Our  study  shows  that  the  likelihood  and  lark-of-fit 
rnrtluKl.s,  on  the  other  hand,  pro<lured  consif»tcntly  good 
results  for  the  datasets  tested.  However,  because  the 
likelihood  method  is  approximate  it  is  not  clear  that  the 
good  results  we  obtained  with  it  will  necessarily  be 
characteristic  of  all  data.sels.  Also,  because  of  the 
undesirable  structural  characteristics  of  the  lack-of-fit 
metlio<(.  it  is  unlikely  to  be  used  roiitinf'ly.  although  in 
rases  where  accuracy  is  of  extreme  importance,  it  may  be 
a  useful  tool  to  have. 


Because  of  the  uncertainty  Lssociated  with  the 
linearization  and  likelihood  methods,  we  also  have  briefly 
examined  how  the  Hates  and  Walls  curvature  measures 
relate  to  the  confidence  region  observed  coverages  we 
obtained  in  this  study.  Our  results  show  that  the  Hates 
and  W^a(ts  parameter  effects  curvature  appears  to  provide 
excellent  indication  of  when  the  linearization  method  may 
produce  less  than  satisfactory  results.  Our  results  are  uc* 
as  conclusive,  however,  about  the  relation  between  intrin¬ 
sic  curvature  and  likelihood  method  coverage  since  the 
solution  locus  for  all  of  our  datasets  were  nearly  planar. 
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Appendix 

Dataset  Id.  p/n  Reference 

1  2AAA  2/12  Gutlinan  and  Meeter  ( lOfi.*)) 

model  T|  _.  ,  page  628 

2  3AAA  2/12  Gultiiian  and  Meeter  (1965) 

model  1^3  ,  page  628 

3  4AAA  2/24  Duncan  (1078) 

model  III  .  page  127 

4  SA.AF  4/18  nimmcll.lau  (1970) 

model  6.2-3  ,  page  183 

5  6AAA  3/13  lliiniitelblau  (1970) 

model  (12-4  ,  page  188 

6  8ACA  4/24  None 


Dataset  Id.  p/n  Reference 

7  9AAG  8/25  Inghold  llertel 

Microwave  Absorption  Line  Function 
(personal  communication) 

8  IIAAB  4/9  Ratkowsky  (1983) 

model  4.4  ,  page  62 

9  rJAAB  4/9  Ratkowsky  (1983) 

model  4.14  ,  page  77 

10  MACG  3/10  Ratkowsky  (1983) 

model  3.5  ,  page  51  and  58 

11  MABO  .3/21  Ratkowsky  (1983) 

mode)  3.5  ,  page  51  and  58 

12  MAAG  .3/42  Ratkowsky  (1983) 

model  .3.5  ,  page  .51  and  58 
1.3  I5AAA  3/16  Ratkowsky  ( 1983) 

model  6.11  ,  page  120  and  58 

14  16AAF  5/27  Ratkowsky  (1983) 

model  6.12  ,  page  122,  123  and  125 

15  I7AAA  2/12  Ratkowsky  ( 198.3) 

model  .3,4  ,  page  .50  and  58 
If.  IKAAA  .3/9  Ratkowsky  (1983) 

noMiel  4  1,  page  61  and  88 

17  I9AAA  .3/9  Rat kowsky  ( 198.3) 

model  4.2  ,  page  61  and  88 

18  4/9  Rat kf.wsky  (1983) 

model  I  3  ,  page  62  and  88 

19  2IAAA  4/9  I^:ltkou^ky  (1983) 

model  4.5  ,  page  6.3  and  88 

20  22AAR  3/5  Ralkow.sky  (1983) 

model  5.1  .  page  9.3  and  102 
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CURVATURES  FOR  PARAMETER  SUBSETS  IN  NONLINEAR  REGRESSION 


R.  Dennis  Cook,  University  of  Minnesota;  Miriam  L.  Goldberg,  University  of  Wisconsin 


The  relative  curvature  measures  of  nonlinearity  proposed  by  Bates  and  Watts  (1980) 
are  extended  to  an  arbitrary  subset  of  the  parameters  In  a  normal,  nonlinear 
regression  model.  In  particular,  the  subset  curvatures  proposed  Indicate  the  validity 
of  linearization-based  approximate  confidence  Intervals  for  single  parameters.  The 
derivation  produces  the  original  Bates-Watts  measures  directly  from  the  likelihood 
function.  When  the  intrinsic  curvature  is  negligible,  the  parameter-effects  curvature 
array  contains  all  Information  necessary  to  construct  curvature  measures  for  para¬ 
meter  subsets. 

Key  Words:  Confidence  regions.  Curvature  measures.  Least  squares.  Likelihood. 


1 .  INTRODUCTION 

Confidence  regions  for  parameters  of  a 
normal  nonlinear  regression  model  are 
commonly  constructed  by  using  linear 
regression  methods,  replacing  the  solution 
locus  with  the  tangent  plane  at  the  maximum 
likelihood  estimate.  Such  tangent  plane 
regions  are  generally  easier  to  construct  than 
corresponding  likelihood  regions.  More 
Importantly,  the  elliptical  contours  of 
tangent  plane  regions  are  relatively  easy  to 
characterize  and  understand,  particularly  for 
one-  or  two-dimensional  parameter  subsets 
which  are  often  of  Interest.  Likelihood 
regions,  on  the  other  hand,  are  not  Influenced 
by  parameter-effects  nonlinearity  and, 
therefore,  generally  have  true  coverage  closer 
to  the  nominal  level  than  do  tangent  plane 
regions.  Under  suitable  regularity  conditions 
and  with  a  sufficiently  large  sample  size, 
tangent  plane  and  likelihood  regions  will  be 
in  good  agreement,  but  In  any  particular 
problem  the  strength  of  this  agreement  Is 
usually  uncertain. 

Bates  and  Watts  (1980)  propose  measures  of 
Intrinsic  and  parameter-effects  curvature  for 
assessing  the  adequacy  of  the  tangent  plane 
approximation:  Relatively  small  values  for 
both  the  maximum  Intrinsic  curvature  f’’  and 
the  maximum  parameter-effects  curvature 
Indicate  that  the  tangent  plane  approximation 
Is  reasonable,  while  relatively  large  values 


for  either  f''  or  f^  Indicate  that  this 
approximation  Is  questionable.  These  Ideas  are 
extended  and  refined  by  Bates  and  Watts 
(1981),  and  Hamilton,  Bates  and  Watts  (1982). 
For  a  review  of  related  literature,  see  Bates 
and  Watts  (1980)  and  Ratkowsky  (1983). 

Programs  for  calculating  f^  and  f^  are  given 
by  Bates,  Hamilton  and  Watts  (1983). 

The  material  In  Bates  and  Watts  (1980) 
represents  an  Important  step  forward,  but 
their  method  for  assessing  the  adequacy  of  the 
tangent  plane  approximation  applies  only  to 
tangent  plane  regions  for  the  full  parameter 
vector.  This  method  is  not  appropriate  for 
assessing  the  adequacy  of  tangent  plane 
regions  for  a  subset  of  parameters,  as 
Indicated  by  Cook  and  Wltmer  (1989)  and 
LInssen  (1980).  It  Is  fairly  easy  to 
construct  examples  where  Is  relatively 
large  and  yet  there  Is  good  agreement  between 
the  tangent  plane  and  likelihood  regions  for  a 
subset  of  the  parameters.  One  such  example  Is 
given  In  Section  2  which  is  a  brief  review  of 
the  tangent  plane  approximation  and  the  Batesr 
Watts  methodology.  We  are  often  Interested  in 
confidence  regions  for  subsets,  particularly 
for  Individual  parameters.  Thus,  the 
Inability  of  the  Bates-Watts  methodology  to 
assess  the  adequacy  of  subset  regions  reflects 
an  Important  gap  In  our  understanding  and 

ability  to  deal  with  nonlinear  models. 


•*  <■  '  *’•  T  »  ^  ^  ^  ■  •  '  »  "  * 
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In  Section  3  we  develop  measures  for 
assessing  the  agreement  between  tangent  plane 
and  likelihood  regions  for  an  arbitrary  subset 
of  parameters  from  a  nonlinear  regression 
model.  The  measures  require  the  same  building 
blocks  as  needed  for  the  construction  of  f^, 
and  reduce  to  f^  when  the  full  parameter 
vector  Is  considered.  Computationally,  these 
measures  require  little  more  effort  than 
Itself.  Section  il  contains  several  examples 
and  our  concluding  comments  are  given  in 
Section  5.  In  the  remainder  of  this  section, 
we  establish  notation  and  briefly  review 

relevant  background  Information. 

A  nonlinear  regression  model  can  be 
represented  in  the  form 

a)  +  ,  1-1 . n  (1) 

where  y^  Is  the  1-th  response,  Is  a  vector 
of  known  variables,  8  Is  a  pxl  vector  of 
unknown  parameters,  the  response  function  f  Is 
a  known,  scalar-valued  function  that  Is  twice 
continuously  differentiable  In  a,  and  the 
errors  are  Independent  and  Identically 

distributed  normal  random  variables  with  mean 
2 

0  and  variance  o  , 

The  maximum  likelihood  (ML)  estimator  8  of 
8  can  be  obtained  by  minimizing  the  residual 
sum  of  squares 

RSS(e)  (y^  -  f{x^,  e))^  (2) 

Kennedy  and  Gentle  (1980)  discuss  methods  for 
obtaining  8.  For  our  purposes  we  assume  that 
8  Is  available. 

For  rotational  convenience,  let 
fj(8)  -  f(x^,8)  and  let  V  denote  the  nxp 
matrix  with  elements  t'’’,  -  3f,/30  ,  1-1 . n 

r-l,...,p. 

Here  and  In  what  follows  all  derivatives  are 
evaluated  at  8  unless  explicitly  Indicated 
otherwise . 

Various  quadratic  approximations  to  be 


used  In  the  following  sections  Involve  the  pxp 
matrices  W  ,  1-1 . n,  with  elements 

■a  A  p 

fj  -  3  fj/3e^38g,  r,s-1,...,p.  These 

matrices  can  be  written  conveniently  In  an 

nxpxp  array  W  (Bates  and  Watts,  1980).  The 

ab-th  "column"  of  W  Is  the  ab-th  second 

derivative  vector  W  with  elements  ft*’  , 
ab  1 

1-1 . n,  while  the  1-th  face  of  W  Is  the 

pxp  matrix  consisting  of  the  1-th  elements  of 
the  second  derivative  vectors  W 

ab 

2.  CURVATURE.S  AND  THE  TANGENT  PLANE 
APPROXIMATION 

Let  F(0)  denote  the  nxl  vector  with 
elements  fj(8).  The  standard  elliptical 
confidence  region  for  0  based  on  replacing 
F(8)  with  the  tangent  plane  at  8  can  be 
written  as 

IS:  liVv*  S  3^g(  (3) 

where  ^  -  (♦g)  -  0  -  8  ,  s^  «  RSS(8)/(n-<p) , 

0  -  h'p) 

and  F  (v, ,  v.)  Is  the  upper  o  probability 
ale 

point  of  an  F  distribution  with  v^  and  Vg 
degrees  of  freedom. 

To  assess  the  adequacy  of  the  region  In 
(3),  we  need  the  standard  quadratic  expansion 
of  F  about  8: 

F(e)  -  F(e)  ♦  V*  ♦  1  ♦  (A) 

Multiplication  involving  three-dimensional 
arrays  is  defined  as  In  Bates  and  Watts  (1980) 
so  that  ♦  H(J  Is  an  nxl  vector  with  elements 

1-1 . n.  Generally,  If  F  la 

quadratic  over  a  sufficiently  large 
neighborhood  of  8  and  the  quadratic  term  of 
(9)  Is  sufficiently  small  relative  to  the 
linear  term,  the  tangent  plane  region  (3) 
should  be  reasonable;  otherwise,  this 
approximation  may  be  In  doubt.  Bates  and 
Watts  (1980,  1981)  Implement  this  Idea  by 


first  decomposing  each  column  of  W  Into  Its 


projections  onto  the  column  and  null  spaces  of 


V:  W  . 
ab 

where  P,, 


P  V  *  (I-P  )  W 
*^V  ab  ab 


W'  ♦  W  '  , 
ab  ab 


Is  the  orthogonal  projection  operator 


for  the  column  apace  of  V.  With  this 
decomposition,  the  quadratic  expansion  (H) 


becomes 


F(e)  !  F(e)  ♦  V*  ♦  i  ♦V*  ♦  i  W%  (5) 

where  W^  and  are  the  nxpxp  arrays  whose 
columns  are  and  W^^,  respectively. 

Next,  the  adequacy  of  the  tangent  plane 
region  Is  assessed  by  using  the  maximum 
parameter-effects  curvature 


iLiV*.LL-  /-P3 

l|v*||" 


and  the  maximum  Intrinsic  curvature 

r''  -  max  -U-*  “  *11  '  /ps  (7) 

llv*||" 

where  the  maximum  Is  taken  over  all  t  In  R^. 
These  curvatures  as  well  as  the  decomposition 

T 

of  <1  W*  displayed  In  (5),  reflect  different 

characteristics  of  the  nonlinearity  of  the 

model.  The  Intrinsic  curvature  Is 

Invariant  under  reparameterlzatlons  and  Is 

thus  a  measure  of  the  Intrinsic  nonlinearity 

of  the  solution  locus.  In  contrast,  P^ 

depends  on  the  parameterization:  different 

parameterlzatlons  can  result  In  substantially 

different  values  of  P^.  If  both  p'  and  p''  are 

sufficiently  small,  the  tangent  plane  region 

(3)  should  be  adequate. 

More  specifically,  for  a  tangent  plane 

region  of  the  form  (3),  Bates  and  Watts  (1980) 

suggest  that  the  linear  approximation  should 

be  adequate  If  p''  and  P^  are  both  small 

compared  to  the  guide  c  -  \/^F  (p,  n-p)  . 

a 

When  p"^  or  p’  Is  greater  than  o,  the  linear 


both  break  down  within  the  tangent  plane 
region.  Thus,  Ratkowsky  (1983)  proposes  that 
c/2  be  used  as  a  cutoff  level,  beyond  which 
the  tangent  plane  region  Is  presumed 

inadequate. 

To  demonstrate  that  the  Bates-Watts 
methodology  can  fall  for  subsets  of  9,  we 
consider  the  Fleller-Creasy  problem  In  which 
the  ratio  of  the  means  of  two  normal 
populations  Is  of  Interest.  The  corresponding 
nonlinear  model  can  be  written  as 


f(x^,  9)  -  +  e^e^d-Xj)  (8) 

where  x^  Is  an  indicator  variable  that  takes 
the  values  1  and  0  for  populations  1  and  2, 
respectively.  For  convenience  we  assume  equal 
sample  sizes  for  the  two  populations  n^-n^^n/2 
and,  without  loss  of  generality,  we  assume 
that  0"  Is  known. 

The  model  given  In  (8)  Is  Intrinsically 
linear  so  that  p’’  -  0.  Further,  Cook  and 
Wltmer  (1984)  show  that 


In  this  case  the  Bates-Watts  (1980)  guide  for 
Judging  the  adequacy  of  the  tangent  plane 
approximation  Is  c  -  (x(ai2))  where  x(aiv) 
Is  the  upper  a  probability  point  of  the  ohl-> 
squared  distribution  with  v  degrees  of 
freedom.  However,  It  Is  clear  that  standard 
methods  can  be  used  to  form  exact  confidence 
Intervals  for  9^,  the  mean  of  the  first 
population,  regardless  of  the  value  of  P^.  In 
other  words,  the  tangent  plane  and  likelihood 
regions  for  9^  are  Identical  for  all  P^. 

A  similar  phenomenon  occurs  In  connection 
2  *2 

with  9^.  Let  r  -  o ‘x(o : ' )/ne^ .  Assuming  that 
r<1.  Cook  and  Wltmer  (1984)  show  that  the  1-a 
likelihood  region  for  9^  can  be  written  as 

(e^  t  (rdi-r)  ♦ 


approximation  and  the  circular  approximation 
that  Is  the  basis  of  the  curvature  measures 


do) 


LferA* 


The  level  associated  with  this  region  is 
exact.  The  corresponding  tangent  plane  region 


^  /  ^2.1/2 
±  (r  +  rOg) 


Clearly,  (10)  and  (11)  will  be  close  only  if  r 
is  sufrioiently  small.  For  any  fixed  value  of 
r,  however,  T^  may  be  large  or  small  depending 
on  the  value  of  6^  so  that  again  the  Bates'! 
Watts  criterion  falls  to  reflect  accurately 
the  agreement  between  the  tangent  plane  and 
likelihood  regions  for  a  parameter  subset.  We 
will  return  to  this  example  at  the  end  of  the 
next  section. 


3.  SUBSETS 

2 

Let  L(0,a  )  denote  the  log  likelihood  for 

model  (1),  and  partition 
T  T  T 

0  -(0j,  Bj)  where  0^  is  a  p^  x  1  vector, 
1-1,2.  The  standard  likelihood  region  for  0^ 
can  be  written  in  the  form  (Cox  and  Hlnkley, 
\97»,  p.  3*13). 


le,!2[L(0,o^)  W  L(g(0-),  02tO^(92))lSpl  (12) 


where  p,  a  selected  positive  constant,  is  used 
to  set  the  nominal  level  and  (g^(02),  0^(02)) 
represents  the  vectorcvalued  function  that 
maximizes 

2 

L(0j,02,  u  )  for  each  value  of  9^,  Evaluating 
(12),  the  likelihood  region  for  0^  can  be 
written  equivalently  as 


I0  :n-log[;[  (y,-f  (g(0  ),  0  ))^/no^]Spl  (13) 

1-1  ‘  ‘  2 


Clearly,  the  form  of  this  region  is  governed 
by  the  vectoreyalued  function  hie^)  -  F(g(02), 
02).  If  h  is  essentially  linear  over  a 
sufficiently  large  neighborhood  of  9^,  the 
contours  of  (13)  will  be  elliptical  and  we  can 
expect  (13)  and  the  corresponding  tangent 


plane  region  to  agreoj  otherwise  these  regions 
will  tend  to  be  dissimilar.  To  determine  when 
these  regions  are  in  substantial  agreement,  we 
Investigate  the  behavior  of  h  by  using  the 
method  described  in  Section  2,  except  that  F 
is  replaced  by  h  which,  in  combination  with 
K  -  (Vj),  contains  essential  information  on 
9^.  Thus,  in  exact  analogy  with  the  Bates" 
Watts  development,  we  will  produce  expressions 
for  the  curvature  of  the  solution  locus 
submanifold  defined  by  h.  Where  necessary  for 
clarity,  we  refer  to  this  as  "subset 
curvature".  Similarly,  "subset  parameter- 
effects",  and  "subset  intrinsic"  refer  to  the 
decomposition  of  the  subset  curvature  into 
components  in  the  submanifold  tangent  plane 
and  its  orthogonal  complement. 

Let  a^iSg)  -  (0^(62))  -  (g^(e2^'  ®2)'  3et 
denote  the  pxp2  matrix  with  elements 
3aj/302j,  1-1,2, ...,p,  J-1,2,...,P2,  snd  let 
A2  denote  the  PXP2XP2  array  with  1-th  face 
Ajj,  1-1, 2, ...,p:  the  elements  of  Aj^  are 

3  aj/302j302)<’  . P2'  assume,  of 

course,  that  g  is  a  twice  continuously 
differentiable  function  of  02.  With  these 
definitions  the  straightforward  quadratic 
approximation  of  h(02)  about  @2  can  be  written 


h(02)  -  F(0)  ♦  Va,*2 


(IHb)  (lU) 


♦  |v(*2A2*2) 


where 


3.1  Refining  Equation  (1*4). 

For  the  quadratic  expansion  In  (1*4)  t-o  be 
useful,  we  need  to  develop  explicit  forms  for 
and  to  produce  a  reexpresslon  of  (1*4) 
that  displays  the  (subset)  parameter-effects 
and  Intrinsic  components  of  h  at  avoid 

Interruption,  the  details  of  this  development 
have  been  relegated  to  the  Appendix.  Here  we 
discuss  the  final  form. 


The  final  form  of  (14)  I3  baaed  on  the 
assumption  that  the  intrinsic  curvature  of  F 
at  6  is  negligible.  That  assumption  is 
somewhat  restrictive  but  it  is  valid  in  the 
Important  class  of  problems  where  the 
parameters  of  Interest  are  nonlinear  functions 
of  the  location  parameters  in  a  linear  model. 
In  any  event,  we  Judge  the  practical 
advantages  of  allowing  for  substantial 
intrinsic  curvatures  to  be  minimal  since 
experience  has  shown  (See  Bates  and  Watts 
1980,  and  Ratkowsky  1983)  that  they  are 
typically  small.  Of  course,  f^  can  and  should 
be  evaluated  in  practice  so  that  this 
assumption  can  be  checked. 

In  the  remainder  of  this  paper  we  use  C(M) 
and  C'(M)  to  indicate  the  column  and  null 
spaces,  respectively,  of  the  matrix  M;  the 
corresponding  orthogonal  projection  operators 

f 

will  be  denoted  by  and  P„,  respectively. 

n  M 

In  their  development  of  the  Intrinsic  and 
parameter'-ef fects  curvatures  for  the  full 
parameter  vector.  Bates  and  Watts  (1980)  found 
it  convenient  and  revealing  to  work  in 
transformed  coordinates.  Similarly,  the 
quadratic  expansion  (14)  is  most  easily 
understood  in  terms  of  these  same  transformed 
coordinates:  Let  V  -  UR  denote  the  unique  QR*- 
factorizatlon  of  V  where  R  is  upper  triangular 
and  the  columns  of  the  nxp  matrix  U  form  an 
orthonormal  basis  for  C(V).  Next,  partition  R 
as 


”11  ”12 


where  R^^  is  p^  x  p^,  1-1,2.  Transformed 

coordinates  can  now  be  defined  as 

~T  “T  “T  T  T 
0  “  (<>^1  ♦  R  30  that 


”1  1*1  *  ”12  *2 


R  a 
22  ^2 


In  the  following  any  quantity  with  a  tilde 
added  above  indicates  evaluation  in  the  i|> 
coordinates.  Thus,  for  example, 

V  -  U  and  W  -  r”^  W  r"'.  Partition  the  1- 


-T  ^1 
R  ‘  W  R  , 


th  face  W^  of  W  as 


ill 

''112' 

1  ,  l-l... 

.,n 

(18) 

'121 

“122; 

“1  ■ 


where  the  dimension  of  W^jj  is  Pj  *  Pj  • 

Next,  define  W^^  to  be  the  nxp2XP2  subarray  of 
W  with  it-th  face  and  similarly  define  W^^ 

to  be  the  nxp^xp^  subarray  of  W  with  ir.th  face 
”112'  partition 

V  -  (V,,V2)  and  U  -  (8^,8^)  where  U^  and 
are  n  x  p^  matrices. 

With  this  structure,  the  quadratic 
expansion  of  h  can  be  reexpressed 
informatively  as 


hO^)  “  F(8)  ♦ 


(19a) 

(19b)  (19) 


where  the  brackets  [■][•]  Indicate  column 

(sample  space)  multiplication  as  defined  in 
Bates  and  Watts  (1980),  and  discussed  briefly 
in  the  Appendix.  Term  (19a)  describes  the 

I 

plane  tangent  to  h  at  0^.  Since  CiU^)  ■  C(Py 
V  ),  this  plane  is  simply  the  affine  subspace 

t 

F(e)  ♦  C(P^  V^).  This  is  the  same  as  the 
subspaoe  obtained  when  using  the  tangent  plane 
approximation  to  Form  a  confidence  region  for 
Bj.  In  other  words,  the  confidence  contour 
based  on  the  tangent  plane  approximation  will 
coincide  with  those  based  on  substituting  the 
linear  approximation  of  h  into  (13),  as 
expected . 

Term  (19b)  contains  the  projections  of  the 
columns  of  W^^  on' o  the  plane  tangent  to  h  at 
0^.  Thus,  this  term  reflects  the  (subset) 
parameter'-ef  fects  curvature  of  h  in  the 


direction  The  maximum  parameter-effects 

curvature  f^  for  the  subset  8.  can  now  be 
S  c 

defined  as 


fgCe^)  -  max||d'^[Py^][W22]<l|KP23  (20> 


where  the  maxlmutii  Is  taken  over  ail  d  In 
D  -  ld:deR  ^  ,  ||d||  -  l|.  Since  is  a 
linear  transformation  of  as  described  In 
(17),  r^Cej)  will  be  the  same  In  both 
coordinate  systems. 

To  further  understand  (20),  partition  the 
1-th  face  Aj  of  the  pxpxp  unsealed  parameterf- 
effects  curvature  array  A  -  [u^][w]  as 


111 

*112 

121 

*122 

(21) 


where  the  dimension  of  A^jj  Is  PjXPj, 

J-1,2,  i-t,...,p.  Next,  let  denote  the 
pjXPjXPj  subarray  of  A  with  faces  A^^^* 
i-p^  +  1 . .  Then 


[P,^][W22]  -  [U2][A22] 

r^(0,)  -  max|ld^A  _d||  /p.s  (22) 
s  2  p 


In  this  form  It  Is  clear  that  the  maximum 
paranetereef fects  curvature  for  the  subset 
problem  depends  only  on  the  behavior  of  the 
parameter-curves.  The  elements  of  caa  be 

used  to  understand  the  behavior  of  these 
parameter-curves  in  terms  of  arcing, 
"corapanslon" ,  fanning  and  torsion,  as 
described  In  Bates  and  Watts  (I98l). 

Term  (19c)  Is  clearly  In  C(V|)  and  Is  thus 
orthogonal  to  the  subspace  tangent  plane. 

This  term  then  reflects  the  Intrinsic 
curvature  of  h  at  8^  so  that  the  maximum 
Intrinsic  curvature  can  be  defined  as 


r^8,)  -  max||[d'^ubtW,2]<l|l2f^P23  (23) 

S  2  p. 


Note  that  (23)  contains  the  extra  factor  2, 
corre3P>onding  to  the  absence  of  the  factor  1/2 
in  (19o). 

This  curvature  can  also  be  expressed  In 
terms  of  a  subarray  of  A.  Let  A^^  denote  the 
p^xp^xpj  subarray  of  A  that  has  faces  A^^^' 
1-Pj«1 . .  Then  A,^  ■  [u^llWij]  and 

7^(8^)  -  max||[d’'][A,2]d||2/i)2S 

-  “ax||  f  dA  d||2/ps  (2A) 

J.p,M  J 

where  dj  Is  the  (J-p, )»th  element  of  d. 
Interestingly,  the  intrinsic  curvature  for  the 
subset  problem  depends  only  on  fanning  and 
torsion  components  of  A;  companslon  and  arcing 
play  no  role  In  the  determination  of  f^.  The 
fanning  and  torsion  terms  of  A  depend  In  part 
on  how  the  columns  of  V  are  ordered.  Since  we 
have  assumed  that  the  last  p^  columns  of  V 
correspond  to  6^,  It  is  the  fanning  and 
torsion  with  respect  to  this  ordering  that  are 
Important. 

If  both  fg  and  are  sufficiently  small, 
the  likelihood  and  tangent  plane  confidence 
regions  for  8^  will  be  similar!  otherwise  we 
can  expect  these  regions  to  be  dissimilar. 
Following  Bates  and  Watts  (1980), 
c  -  lF^(P2,n“p))  ‘  can  be  used  as  a  rough  guide 
for  judging  the  size  of  these  curvatures.  As 
noted  earlier,  our  experience  Indicates  that 
curvatures  must  be  substantially  less  than  c 
to  Insure  close  agreement  between  tangent 
plane  and  likelihood  regions.  This  will  be 
Illustrated  In  sections  3.3  and  A. 

Finally,  we  combine  the  Intrinsic  and 

parameter-effects  components  of  (19)  to  define 

the  total  curvature  f  (0_)  of  h  at  8_  as 
3  2  2 

13(62)  -  *^2®  I  h^A^gdl  1^ 

♦  4||[d’^][A,2]dl|^l’''^ 


^  -  -  • 


(25) 


As  will  be  demonstrated  in  the  next 


subsection,  the  total  subset  curvature  may 

be  more  relevant  than  both  and  r^.  For 

s  3 

example,  it  is  possible  to  have  <  c  and 

r  <0  while  r  >  c.  In  such  situations 

3  3  3 

and  may  incorrectly  Indicate  that  the 
tangent  plane  approximation  is  adequate,  while 
correctly  indicates  otherwise. 

When  the  full  parameter  8  Is  of  Interest, 
we  have  8^-8  and  p^  -  p.  In  this  case,  the 
subset  intrinsic  curvature  (2*1)  is  zero,  A^j, 
is  the  Sates-Watts  parameter-effects  array, 
and  both  (22)  and  (25)  represent  the  maximum 
parameter.eef feots  curviture  for  8.  Thus,  our 
derivation  based  on  the  likelihood  reproduces 
the  primary  quantity  developed  by  Bates  and 
Watts  (1980). 

The  main  conclusion  of  this  section  is 
that  the  unsealed  parameter-ef f ects  curvature 
array  A  for  the  full  parameter  contains  all 
necessary  information  for  evaluating  the 
adequacy  of  tangent  plane  confidence  regions 
for  certain  subsets  of  6.  For  example,  if  the 

last  parameter  8  is  of.  Interest  then  r'(6  ) 

P  3  P 

is  simply  sla  I  where  a,,,  is  the  (J,k)-th 
'  PPP'  IJk 

element  of  the  l^th  face  of  A.  Similarly, 


r^(e  )  -  2s(  y  a^,  ) 

3  p'  pip' 


1/2 


(26) 


Thus,  comparislon  and  fanning  are  the  only 
effects  that  are  relevant  to  an  a.ssessment  of 
the  agreement  between  likelihood  and  tangent 
plane  confidence  regions  for  a  single 
parameter . 


and  consequently  9p  is  the  only  single 
parameter  for  which  curvatures  can  be 
constructed  from  a  given  pararaeter‘-effect3 
array  A.  The  A-array  for  other  orderings  can 
be  constructed  by  permuting  the  columns  of  V 
and  beginning  again,  of  course. 

Alternatively,  a  computationally  more 
efficient  method  for  obtaining  the  Anarray  in 
a  rotated  coordinate  system  can  be  constructed 
as  follow.3.  Let  -  Z4i  where  Z  is  a  selected 
pxp  permutation  matrix.  In  what  follows,  the 
subscript  z  added  to  any  quantity  indicate 

evaluation  in  the  coordinates  (f  .  Clearly, 

T  T  *T  ^ 

-  VZ  -  URZ  .  Let  U  be  an  orthogonal 

(f  XT 

matrix  such  that  R  -  U  HZ  is  upper 
triangular.  Since  the  QR-factorizatlon  of  V 

z 

is  unique,  it  follows  that  V  -UR  where 
»  »  z  z  z 

U  -  UU  and  R  -  R  .  Using  this  structure  it 
z  z 

is  not  difficult  to  verify  that 

r  'ir  »  *Ti 

A^  -  [U  llu  AU  ‘]  (27) 

Thus,  to  find  A^,  the  parameter-ef fects 

curvature  array  for  the  rotated  coordinates 

> 

♦  ,  we  need  only  the  pxp  matrix  U  to 
*  T 

diagonalize  RZ  .  A  single  call  to  LINPACH 

,(1979)  routine  SCHEX  produces  R  ,  [u  ^](a]  and 

» 

the  information  necessary  to  construct  U  . 

3 . 3  Fieller-Creasy  Again 

To  apply  and  f’  in  the  Fieller-Creasy 
problem  when  0^  is  the  subset  of  Interest,  we 
require  only  the  2x2x2  parameter-ef fects 
curvature  array  A  for 


3.2  Computation 

Recall  that  the  developments  of  this 
section  are  based  on  the  assumption  that  the 
last  columns  of  V  correspond  to  the 
parameters  of  interest.  This  assumption  is 
necessary  to  maintain  the  collective  identity 
of  as  indicated  in  (17).  This  implies  that 
the  ordering  of  the  columns  of  V  is  critical 


I 


V  -  (x*e,,(b-x),  6^(b-x)) 

where  x  is  the  nxl  vector  with  elements  x^  as 
defined  following  (8)  and  b  is  an  nxl  vector 
of  one's.  The  faces  A^  of  A  are  (Cook  and 
Wltmer,  198i|) 


7-  ^  r 


e^/2 


e,ln(i+e2) 


1/2 


and 


0  1 
I  “26, 


*2  -  */»2 


222' 


/hio 


(i  ♦  e^) 


2ll/2 


and 


r^ce^)  -  2o|a 


212' 


2\^^, 
/fi|8^  I 


U6 


2ll/2 


(28) 


(29) 


Reading  directly  from  this  array  we  have 
'■"(ep)  -  (i|a 


(30) 


(31) 


Recall  that  we  are  asaumlng  a  to  be  known  In 

this  example  so  that  the  guide  for  assessing 

the  magnitudes  of  f''  and  Is  o-(x(a;l  ))”"'^^. 
3  3- 

From  (30)  we  see  that  r  (6.)  will  be  zero 
«  s  2 

only  If  Sp'O;  In  this  case 

<  0  or,  equivalently, 

2  *2 

r  •=  2o  ■x(ai  1  )/n0^  <  l/t  Is  necessary  for  the 

subset  Intrinsic  curvature  to  be  less  than  the 

guide.  Further  r  <  1/i|  Is  a  sufficient  -- 

although  not  necessary  -  condition  for  both 

r'^(e_)  and  r^(0_)  to  be  less  than  c  when  B_  Is 
3  2  s  2  2 

arbitrary. 

Next,  using  (25)  It  follows  that  the  total 
subset  curvature  Is  simply 


l-gfOp)  -  (32) 

and  thus  <  c  if  and  only  if  r  <  l/il. 

When  r  >  1  the  likelihood  region  for  0^  will 
be  either  the  complement  of  an  interval  or 
else  the  entire  real  li/ie;  otherwise  this 
region  will  be  the  interval  given  in  (tO).  In 
this  example,  the  total  subset  curvature 
recovers  the  critical  quantity  r  as  Introduced 


In  section  2,  and  the  condition  <  c  insures 
that  the  tangent  plane  Interval  (11)  will  in 
fact  be  approxlmat ing  a  likelihood  interval 
rather  than  some  dissimilar  region.  This 
condition  also  provides  for  an  added  measure 
of  agreement  between  these  Intervals  since  it 
is  equivalent  to  r  <  1/M  rather  than  simply 
r  <  1 . 

Applying  (22)  and  (24)  when  0^  Is  the 
subset  of  interest  gives 

expected.  Notice  that  this  conclusion  cannot 
be  obtained  by  inspecting  the  A  array  given  in 
(28)  and  (29).  As  mentioned  previously, 
different  subsets  in  general  require  different 
orderings  for  the  columns  of  V  and  thus 
different  coordinates.  This  is  the  case  here. 

Finally,  we  consider  the  special  case 
characterized  by  (6^,  e^)  *  C3»0)  and 
r  -  .428,  These  conditions  correspond  to 
n  -  2c^.  From  (9),  r^-.33  <  ."1  •  x"'^^ 
(.05:2).  From  Figure  1  (Cook  and  Wltmer 
198*:),  we  see  that  the  likelihood  region, 
whose  level  Is  exact  in  this  case,  does  not 
seem  to  be  adequately  approximated  by  the 
tangent  plane  region  for  small  values  of  6^. 

Further  Insight  into  this  problem  can  be 
gained  by  Inspecting  marginal  regions  for 
and  e^.  Generally,  marginal  regions  for 
subsets  can  be  obtained  by  projecting  all 
points  In  the  Joint  region  onto  the 
appropriate  subspaces.  The  projections  of  the 
regions  in  Figure  1  onto  the  8^  axis  show  that 
the  likelihood  and  tangent  plane  Intervals  for 
8)  will  be  Identical,  as  expected,  fhe 
projections  onto  the  0^  axis  show  that  the 
resulting  98.6  percent  likelihood  interval 
will  be  about  60  percent  longer  than  the 

corresponding  tangent  plane  Interval  1  This 

dissimilarity  Is  clearly  Indicated  oy 

r^fe,)  -  .67  >  .*11  •  (.01M:1)  . 

3  2 

Our  experience  leads  to  the  following 
heuristic  characterization  of  the  problem 
described  In  the  previous  paragraph.  Consider 

a  p  **dlmensional  subset  9  with  guide 
“^1  /2  ' 

o  »(f  (p-,n-p))  ■  and  partition 

2  a  2 


®2  ’  ®22^  where  0^^  is  PjjXl.  1-1.2. 

The  guide  corresponding  to  the  confidence 
region  for  obtained  by  projecting  the 

selected  l-o  region  for  8  Is  simply 
1  /2  ^ 

°2i  ”  °2^’’21^‘’2^  ’  subset 

curvatures  for  8^^  sre  large  relative  to  c^j 
and  the  subset  curvatures  for  0^2  af®  near 
zero,  it  can  happen  that  the  curvatures  for  0^ 
are  moderate.  In  such  cases  the  curvatures 
for  0^  can  provide  a  misleading  Indication 
that  the  tangent  plane  and  likelihood  regions 
for  0^  are  In  acceptable  agreement.  As  hinted 
above,  this  problem  might  be  overcome  by 
requiring  that  all  subsets  0^^  of  0^  have 
curvatures  less  than  the  respective  guides 
c^^ .  When  0^  =  0  this  added  requirement  seems 
to  represent  a  useful  fine  tuning  of  the  basic 
Bates-Watts  methodology. 

H.  ILLUSTRATIONS 

In  this  section  we  present  several 
numerical  examples  to  illustrate  selected 
results  of  the  previous  sections. 

For  the  first  example  we  use  the 
Mlchaelis-Menton  model 

fl  -  0,Xj/(02*Xj)  (33) 

in  combination  with  the  12  observations 
reported  in  Dates  and  Watts  (1980).  Figure  2 
gives  87  percent  tangent  plane  (broken 
contour)  and  likelihood  (solid  contour) 
confidence  regions  for  (e^.B^).  Here  and  in 
the  following  examples  the  levels  of  displayed 
bivariate  confidence  regions  are  chosen  so 
that  the  corresponding  univariate  marginal 
regions  have  a  nominal  95  percent  coverage 
rate.  It  seems  clear  from  Figure  2  that  the 
tangent  plane  region  for  (O^.e^)  is  not  an 
adequate  approximation  of  the  likelihood 
region,  although  Interpreting  the  Date3“Watts 
guide  directly  as  the  cutoff  value  would  lead 
to  the  opposite  conclusion,  since 

«  .598  <  c  “  .635.  The  subset  curvatures 
for  0^  and  8^  are  listed  in  Table  1 ;  the 
corresponding  guide  is  c  - 


curvatures  are  less  than  the  guide  while  the 
marginal  likelihood  regions  do  not  seem  to  be 
well  represented  by  the  corresponding  tangent 
plane  regions.  This  reenforces  our  previous 
remark  that  curvatures  must  be  substantially 
less  than  c  to  insure  close  agreement.  With 
this  Interpretation  we  see  that  all  curvatures 
successfully  Indicate  the  dissimilarity 
between  the  various  likelihood  and  tangent 
plane  regions  In  Figure  2. 

Figure  3  gives  88t  likelihood  and  tangent 
plane  regions  for  (e^.S^)  obtained  by  using 
model  (33)  and  the  7  observations  reported  by 
Michaells  and  Menton  (1913).  For  these  data 
-  .079.  This  value  and  the  subset 
curvatures  reported  In  Table  1  are  relatively 
small.  Indicating  reasonable  agreement  between 
the  regions  displayed  in  Figure  3. 

For  our  next  example  we  use  the 
exponential  model 

f^  -  0 Jl“exp(02Xj))  (34) 

In  combination  with  the  6  observations 
reported  In  Draper  and  Smith  (1981,  p.  522., 
data  set  3).  In  this  case  -  1.92  clearly 
Indicates  the  dissimilarity  between  the  88 
percent  regions  for  (S^.S^)  shown  in  Figure  U. 
However,  the  95t  marginal  regions  for  8^  are 
In  close  agreement,  while  the  agreement 
between  the  marginal  regions  for  0^  seems  less 
than  adequate.  These  conclusions  are  clearly 
Indicated  by  the  subset  curvatures 
r  (0-)  -  .069  and  f  (0,)  .31^  which  may  be 

S  ^  3  1 

Judged  relative  to  the  guide  c  *=  .360. 

For  the  threer^parameter  asymptotic 
regression  model 

fj  -  0,  -  0j^oxp(0^Xj)  (35) 

and  the  27  obsnrvat Ions  reported  In  Ratkowsky 
(1983.  p  101,  data  set  1),  we  obtain  -  1.53« 

The  corresponding  guide  is 

c  -  [f  g^(3.29)]'''’^  -  .58  .  This  suggests  that 


.ilM9.  Again,  the 
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the  95  percent  likelihood  region  for 
8^  -  cannot  be  adequately 

approximated  by  the  corresponding  tangent 
plane  region.  The  subset  curvatures  for 
selected  subsets  of  6  are  listed  in  Table  1. 
From  these  curvatures  alone  we  would  reach  the 
following  conclusions:  1)  The  likelihood  and 
tangent  plane  regions  for  8^  are  in  very  close 
agreement.  2)  The  marginal  regions  for  0^ 
and  8^  will  be  noticeably  different,  but  the 
agreement  is  probably  adequate  for  most 
purposes.  3)  The  usual  95  percent  tangent 
plane  regions  for  (8^,8^)  and  (82, 9j)  should 
be  used  for  only  very  rough  analyses,  although 
lower  level  regions  may  be  acceptable 
replacements  for  the  corresponding  likelihood 
regions.  These  conclusions  are  supported  by 
the  86  percent  regions  for  and  (8^,8^) 

shown  in  Figures  5  and  6,  respectively. 

For  our  final  example  we  again  use  the 
asymptotic  regression  model  (35),  this  time  In 
combination  with  the  9  observations  reported 
by  Hunt  (1970).  Subset  curvatures  for  « 
parameter  subsets  are  listed  In  Table  1.  The 
subset  curvature  for  8^  Is  small.  Indicating 
good  agreement  between  the  corresponding 
likelihood  and  tangent  plane  regions.  The 
subset  curvatures  for  the  remaining  subsets, 
particularly  ahe  large. 

The  87  percent  likelihood  and  tangent 

plane  confidence  regions  for  are  given 

in  Figure  7.  The  large  total  curvature, 

r  (8,, 8,)  -  36.lt,  correctly  Indicates  that  use 
s  2  3 

of  the  tangent  plane  region  as  an 
approximation  of  the  disjoint  likelihood 
region  would  be  a  disaster  for  this  pair  of 
parameters.  In  fairness,  however.  It  should 
be  recalled  that  the  approximations  used  to 
derive  the  subset  curvatures  are  local  so  that 
’®2^  Is  responding  primarily  to  the 
disagreement  between  the  tangent  plane  region 
and  the  portion  of  the  likelihood  region  that 

contains  8.  Similar  comments  apply  when  only 
8^  la  of  Interest. 

From  Figure  7,  there  Is  reasonable 


agreement  between  the  tangent  plane  and 

likelihood  regions  for  8^,  as  indicated  by  the 

small  curvature  f  (8.)  -  .095.  It  can  be 
a  3 

argued  Justifiably,  however,  that  this  correct 
indication  from  the  curvature  is  largely 
fortuitous  since  the  curvatures  do  not 
recognize  the  contribution  of  the  smaller 
piece  of  the  likelihood  region  for  (6^,8^)  to 
the  likelihood  region  for  8j.  Under  this 
argument,  the  subset  curvature  measure  for  8^ 
has  failed  to  Indicate  the  dissimilarity 
between  the  tangent  plane  region  for  8^  and 
the  likelihood  region  (c. 0191,0)  obtained  by 
using  only  the  larger  subregion  that  contains 
8. 

The  reason  that  the  curvatures  give  some 
Inappropriate  Indications  In  this  final 
example  Is  that  both  the  linear  and  quadratic 

approximations  to  the  model  function  fall. 

2 

This  failure  Is  evident  from  a  very  low  R 
from  the  regression  used  by  Goldberg,  Bates 
and  Watts  (1983)  to  obtain  numerical 
curvatures,  and  from  related  measures  of  "lack 
of  quadratlclty"  explored  by  the  present 
authors.  In  cases  where  the  quadratic 
approximation  to  the  model  function  Is  poor, 
curvature  measures  based  on  that  approximation 
may  not  be  meaningful. 

Nevertheless,  these  subset  curvature 
measures  represent  an  important  advance  In  our 
understanding  of  nonlinear  models,  and  provide 
useful  information  about  the  adequacy  of  the 
linear  approximation  when  the  quadratic 
approximation  Is  appropriate.  Further  work  Is 
needed  on  methods  of  Identifying  cases  where 
the  quadratic  approximation  may  fall. 

5.  CONCLUSIONS 

The  subset  curvatures  developed  In  this 
paper  appear  to  be  reliable  Indicators  of  the 
adequacy  of  tangent. plane  confidence  regions 
for  most  nonlinear  models.  In  particular,  the 
curvature  for  a  single  parameter  is  a  useful 
tool  for  assessing  the  agreement  between 
standard  large  sample  confidence  Intervals  and 


corresponding  marginal  likelihood  regions. 

This  ability  to  deal  with  subsets  greatly 
extends  the  usefulness  of  the  Bates^Hatts 
methodology. 

Because  the  original  BatesrWatts  framework 
applies  only  to  the  complete  parameter  vector, 
guidelines  developed  In  that  framework  can  be 
misleading  when  the  adequacy  of  the  linear 
approximation  Is  very  different  for  different 
subsets.  To  ensure  good  agreement  between  the 
tangent  plane  and  likelihood  regions,  the 
maximum  curvature  must  be  considerably  smaller 
than  the  Bates-Watts  guide.  However,  this 
criterion  can  be  too  stringent  for  certain 
parameter  subsets  If  the  whole-'parameter 
curvature  f'  la  used.  By  contrast,  the  subset 
curvature  describes  the  shape  of  the 
likelihood  region  In  the  parameter  subspace  of 
Interest.  Thus,  the  subset  curvature  Is  more 
directly  relevant  to  the  tangent  plane 
adequacy  question  and,  based  on  the  examples 
described  above.  Is  evidently  more  accurate. 

The  practical  usefulness  of  the  methods 
described  here  depends.  In  part,  on  their  ease 
of  Implementation,  The  subset  curvatures  for 
any  selected  subset  can  be  computed  directly 
from  the  Bates'iHat^  pararaeter->ef fects 
curvature  array.  This  array  can  be  obtained 
either  analytically  (Bates  and  Watts,  1980)  or 
numerically  by  using  the  procedure  given  In 
Goldberg,  Bates  and  Watts  (1983). 

The  usefulness  of  the  subset  curvatures 
depends  also  on  the  restriction  that  the 
Intrinsic  curvature  of  F  at  6  la  small.  This 
restriction  Is  not  of  great  practical 
Importance  since  It  has  been  found  to  hold  In 
most  cases.  Nevertheless,  a  unified  approach 
which  Incorporates  the  Intrinsic  curvature 
component  might  offer  further  Insight  In  some 
situations. 

Another  area  for  further  research  Is  the 
development  of  measures  that  Indicate  when  the 
subset  curvatures  themselves  may  be  unreliable 
due  to  the  failure  the  “econdnorder 
approximation  to  the  model  function.  While 


( 

I 

) 

I 

t 

the  possibility  of  such  failure  Is  of  concern,  | 

the  class  of  models  adequately  described  by  a  i 

quadratic  function  la  considerably  larger  than  j 

the  class  for  which  the  linear  approximation  I 

alone  Is  adequate. 
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APPENDIX 


Derivation  of  Equation  (19) 

To  develop  equation  (19)  from  equation 
(lA),  we  first  require  explicit  expressions 
for  and 

A.l.  4^  and  4^ 


Let  L  and  L  denote  the  pxp  matrix  and 
pxpxp  array  of  second  and  third  partial 
derivatives  of  the  log  likelihood  L  with 
respect  to  the  elements  of  B,  respectively. 
Let  g  denote  the  a-th  component  of  g  as 
defined  following  (12)  and  partition  L  as 


where  Ljj  is  Pj  *  Pj > 

Since  g  maximizes  Lie^.B^)  for  each  fixed 
value  of  B^  we  clearly  have 


3L(g(B2).92) 


t 

i 

i 


0 


(A.l) 


for  a-1,2 . p^  and  all  e^.  This  Identity 

will  be  used  as  the  basis  for  obtaining  and 

*2’ 

Differentiating  both  sides  of  (A.1)  with 
respect  to  9^  and  evaluating  at  6^  gives 
(L^ , >  ° 

Since  the  submatrlx  consisting  of  the  last  p^ 
rows  of  Is  an  Identity  matrix  It  follows 


wl“’  l 

■'ll  12 


Let  ej  -  y^  -  f^(6).  The  the  first  term  of 


L  •  (I  ejWj  v'^v)/a^ 


represents  Intrinsic  curvature  of  F  at  0. 

Since  this  curvature  Is  assumed  to  be 
. .  T  2 

negligible,  L  -  -V  V/o  and  therefore 


-(Vjv,)''’v>2 


‘•"it  ”12 


where  V  -  (V, .V^)  and  R^j  Is  defined  In  (15). 

An  expression  for  A^  can  be  obtained 

similarly  by  taking  second  partial  derivatives 

of  (A.l)  with  respect  to  0-  and  B-  , 

2r  23 


r,3-1 ,2, . . 


This  yields 


ab  38_  39, 
2r  2s 


where  L  L  .  and  a.  denote  the  Indicated 
ab  abc  b 

elements  of  L,  L  and 

-  (g^CBj),  e^),  respectively,  and 

a*1,2, _ ,Pj.  The  component  3^aj^/392^392g  Is 

the  (r,s)-th  element  of  the  b-*th  face  A_.  of 

2b 


A^.  Since  A^^^ 


0  for  b-Pj«l,...,p  the 


summation  on  the  left  of  (A. A)  need  only  range 


from  1  to  p, 


Notice  also  that  3a„/39_  Is 
c  2r 


simply  the  (o,r)-th  element  of  A^ .  Expressing 
(A.M)  In  matrix  notation  and  solving  for  A, 


[‘I  *,] 


Here  and  In  what  follows  brackets  [  ][  ] 
Indicate  column  multiplication  as  defined  In 
Bates  and  Watts  (1980).  (Generally,  If  A  Is 
an  axb  matrix  and  B  Is  a  bxoxd  array  then  the 

elements  of  the  1-th  face  C  ,  1-1,..., a,  of 

*  T 

the  axcxd  array  C  -  [A][BJ  are 

J-1,2,...,c,  k-1,2,...,d,  where  A^  Is  the  1-th 
row  of  A  and  Bj^^  Is  the  Jk-th  column  of  B.) 

As  before  we  will  take 
T  2 

Si  -  “SS'o  • 

To  further  evaluate  A^,  we  require  the 
pxpxp  array  L.  Straightforward  algebra  will 
verify  that 


Sbc  j., 


abc.  a  j.bc  ,  ^b  ^ac  .  ^c  ^abj 


Using  this  representation  It  is  easily 

verified  that  the  a-th  face  L  of  L  Is 

a 


S  •  ^  ^2  Hb3v’')[Hl  *  v\  «  kJv)  (A. 6) 
0 


where  b  Is  the  a-th  standard  basis  vector  for 

a  T 

and  K  -  b  W  Is  the  nxp  matrix  with  W  as 
a  a  ^  ac 

the  c-th  column.  Finally,  It  follows  from 
(A. 6)  that 


z’'  L  Z  -  ^  ♦  2[zV][h1z1  (A. 7) 

0 

where  Z  Is  an  arbitrary  pxl  vector.  This  form 
will  be  useful  In  later  developments. 
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A. 2  Tangent  plane.  Term  (19a) 

It  follows  Imnedlately  from  (A. 3)  that 
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where  Is  defined  following  (18).  Thus,  the 
relevant  tangent  plane  Is  the  affine  subspace 
F(e)  ♦  CCPy^Vj).  Transforming  term  (lAa) 
according  to  (16)  and  (17)  Immediately  gives 
term  (19a). 

A. 3  Parameter-Effects,  Term  (19b) 

From  the  form  of  given  by  (A. 3),  It  la 
clear  that  term  (19c)  Is  In  C(V^)  and  la  thus 
orthogonal  to  the  e^-eubspace  tangent  plane. 
The  parameter-effects  component  of  (19)  must 
therefore  come  from  term  (19b). 

The  three-dimensional  array  W  In  (19b)  can 
be  decomposed  Into  the  sum  of  three  arrays 
with  orthogonal  columns. 


«  -  [Py*-’’v  *  (PyH”] 


The  first  term  In  this  decomposition  contains 

I 

the  projections  of  the  columns  of  H  onto  C(P„ 

,  ''l 

Vj)  and  thus  It  represents  parameter-effects 

curvature  for  the  subset  problem.  The  second 

and  third  terms  are  Intrinsic  components  for  h 

and  F,  respectively.  Since  the  Intrinsic 

curvature  of  F  at  e  Is  assumed  to  be 

negligible,  the  third  term  of  (A. 9)  Is  set  to 

zero.  Addend  (19b)  can  now  be  reexpressed  as 


i  *3  *2 

-  2  *2  '^“'*1  *2 

’  (A. 10) 

♦  I  *1  A]^[PyJ(vf]6,l2  (A. 10b) 
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Using  this  In  combination  with  (17)  and  (A. 8)  j 

to  transform  the  coordinates  In  term  (A. 10a)  j 

gives  term  (19b).  ' 

A. 9  Intrinsic  Curvature.  Term  (19c)  , 

In  the  expansion  of  h  given  In  (19),  we  | 

still  have  the  sum  of  terms  (19o)  and  (A. 10b)  I 

to  deal  with.  We  first  consider  (l9o). 

Using  (A. 5)  and  (A. 7)  with  Z  -  we 

have 

§  V(♦^  Ag  *2)  -  (1^  a{  ‘l’  a,  Ig) 

-  M  [*2  Ay][w]A,  *2 

where  M  -  (V^IV^V,)"’  ,  o).  The  first  term  of 
(A. 11)  Is  exactly  the  negative  of  term  (A. 10b) 
so  that  In  an  obvious  notation 

(19o)  ♦  (A. 10b)  -  r  M[iyv’‘l[w]A,42 

(*.)2) 

.  -  [♦yV^)[MWA,)  *2 

From  (18)  and  the  definition  of  W,  It  can  be 
shown  that 
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Finally,  using  this  relationship,  (A. 8)  and 
(17)  to  transform  the  coordinates  In  (A. 12)  we 
obtain  term  (19c). 


From  (18)  and  (A. 3)  It  follows  that 
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NoMlnal  88X  bivariate  regions  with  9SS  Marginal 
regions  for  (e^e^)  froM  Model  (33)  and  the 
Michael Is-Menton  (1913)  data.  Likelihood - . 
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Figure  4.  Mowlnal  86S  bivariate  regions  with  9SI  marginal  regions 
for  (epe^)  from  model  (34)  and  the  Draper-Smith  (1981) 
data.  Likelihood  - .  Exact - . 
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ABSTRACT 


REX  (Regression  Expert)  demonstrated  the  feasibility  of  building  data 
analysis  consultation  programs  using  e;q)ert  system  techniques.  However, 
experience  with  REX  develt^ment  showed  the  need  for  automated 
assistance  in  building,  maintaining,  and  extending  knowledge  bases  for  other 
specific  data  analytic  tasks.  Symptoms  of  this  need  were  difficulty 
maintaining  consistency  across  examples,  need  for  the  statistician  to  learn  an 
obscure  language,  and  difficulty  of  specialization. 

Programming  by  examples  is  a  natural  approach  in  the  statistics  domain, 
because  working  examples  is  necessary  in  any  case.  Such  an  approach 
would  address  the  problems  noted  in  the  development  of  REX.  Three 
fundamental  steps  in  the  development  of  a  practical  programmed-by-example 
system  are  the  acquisition  of  the  first  example,  acquisition  of  an  additional 
consistent  example,  and  the  integration  of  an  inconsistent  example. 

By  restricting  the  domain  within  which  knowledge  can  be  acquired  to  data 
analysis,  it  has  been  possible  to  design  practical  solutions  to  these  three 
steps.  The  first  phase  of  Student,  a  system  designed  to  learn  data  analysis 
strategies  from  examples,  has  been  implemented.  It  acquires  the  first 
example  in  any  data  analysis  area,  and  incorporates  many  features  required 
for  handling  problems  of  additional  consistent  and  inconsistent  knowledge. 


1.  Background 

REX  is  a  consultation  program  in  an  area  of 
statistics,  regression  analysis,  built  using  expert 
system  techniques.  Its  performance  was 
described  in  [Pregibon  and  Gale,  1984].  It  had 
an  active  life  as  a  demonstration  system, 
running  about  weekly  for  a  year.  It 
demonstrated  the  feasibility  of  using  expert 
system  techniques  to  build  a  consultant  in  data 
analysis.  However,  as  detailed  in  the  next 
section,  the  knowledge  acquisition  process  for 
REX  left  a  lot  to  be  desired. 

Regression  analysis  is  one  technique  of  a 
broader  category  of  data  analysis  techniques. 
Other  techniques  include  spectrum  analysis, 
analysis  of  variance,  and  cluster  analysis,  for 
example.  A  statistician  doing  data  analysis 


operates  on  a  data  set  or  example.  A  general 
goal  of  the  analysis  is  to  meaningfully 
summarize  the  salient  features  of  the  data  set. 
The  standard  form  of  summary  is  a  statistical 
model,  typically  with  parameters  estimated  from 
the  data  set.  By  using  plots  and  numerical  tests, 
the  statistician  detects  incompatibilities  between 
the  model  and  the  data  set,  which  are 
ameliorated  by  some  action,  such  as 
transforming  a  variable,  changing  the  model,  or 
changing  the  method  of  estimating  parameters. 

In  mimicking  this  process,  REX  checks  for 
problems  using  tests,  and  reconunends  actions  to 
the  client  after  verifying  that  a  proposed  action 
will  solve  the  problem  found.  It  offers  to  show 
the  client  plots  whenever  it  detects  a  problem  or 
recommends  an  action. 
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In  building  REX>  the  stalistica]  knowledge  it 
contains  has  come  to  be  called  a  "strategy*  for 
regression  analysis.  The  term  seems  appropriate 
as  the  nature  of  the  knowledge  includes 

what  problems  to  look  for, 

when  to  look  for  them, 

how  to  look  for  them, 

how  to  decide  if  they  are  real,  and 

what  to  do  if  they  are. 

There  is  very  little  statistical  literature  relevant 
to  strategy,  and  indeed,  REX,  as  an 
environment  for  developing  and  testing  strategy 
has  opened  up  a  new  area  of  research. 

2.  A  Critique  of  Knowiedge  Acquisition  in  REX 

Developing  a  strategy  for  use  in  REX  was  a 
labor-intensive  process.  Two  phases  can  be 
distinguished.  In  the  first  phase  the  statistician 
responsible  for  the  strategy,  Daryl  Pregibon, 
chose  a  half  dozen  regression  examples  that 
clearly  showed  some  common  problems.  He 
then  analyzed  them  using  interactive  statistical 
software  with  an  automatic  trace.  After 
analyzing  the  group  of  examples,  he  studied  the 
traces  and  abstracted  a  description  of  what  be 
was  doing.  We  coded  this  as  a  strategy  for 
REX  and  tried  it  on  a  few  more  examples.  He 
revised  the  strategy  completely  at  this  point,  and 
the  second  phase  began. 

In  the  second  and  longer  phase,  one  of  os 
selected  one  additional  regression  example  and 
ran  REX  interactively  on  the  chosen  example. 
Typically  the  strategy  would  not  handle  the 
example  (since  the  example  was  selected 
knowing  what  would  stretch  REX),  and  we 
modified  the  strategy  so  that  the  example  would 
be  handled.  This  process  was  iterated  through 
about  three  dozen  more  examples. 

Based  on  this  experience,  and  on  a  feeling  that 
it  was  typical  of  other  techniques,  we  do  not 
believe  it  is  possible  to  construct  a  data  analysis 
strategy  without  working  through  many 
examples.  The  range  of  the  decisions  needed  to 
construct  a  strategy  is  extreme,  and  there  is  no 
literature  simplifying  the  task.  Therefore  the 
only  available  defense  of  a  strategy  is  to 
demonstrate  performance,  which  requires 
working  many  examples  more  than  those  used 
to  construct  the  system.  Our  experience  also 
leads  us  to  believe  that  it  is  easy  to  generalize 


from  data  analysis  examples  -  relatively  few 
examples  are  needed  to  exhibit  the  required 
distinctions. 

However,  the  way  in  whidi  we  worked 
examples  for  REX  was  far  from  ideal.  The  first 
difficulty  with  our  method  was  assuring 
ourselves  that  a  strategy  modified  to  work  one 
additional  example  still  worked  all  previous 
examples.  We  could  by  brute  force  run  REX  in 
batch  mode  on  all  previous  examples  and  see  if 
the  performance  was  the  same.  Usually  we 
reasoned  that  most  of  the  previous  examples 
could  not  be  affected,  and  checked  the  few  that 
might  be  affected  by  hand.  Naturally,  the  more 
examples  worked,  the  more  severe  this  problem 
became.  The  necessity  to  check  consistency  in 
batch  mode  for  a  system  designed  to  be 
interactive  reduced  the  flexibility  of  the  strategy 
developed. 

Secondly,  the  method  used  was  the  epitome  of 
the  currently  standard  two-person  development 
of  expert  systems.  I  wrote  the  inference  engine 
used  while  Daryl  was  responsible  for  the 
strategy  developed.  Whenever  Daryl  wanted  to 
do  something  he  hadn't  done  before,  we  had  to 
huddle,  as  Daryl  was  learning  a  language  he 
would  only  use  to  construct  one  program.  In  a 
department  with  twenty  professional  statisticians 
and  one  person  intimately  familiar  with  the 
inference  engine,  it  was  not  clear  bow  many 
additional  data  analysis  techniques  could  be 
bandied  by  this  two  person  approach. 

Thirdly,  it  would  be  difficult  to  modify  the 
strategy  in  REX.  Modifiability  is  important  first 
because  a  growing  literature  on  strategy  can  be 
expected  to  suggest  desirable  changes.  It  is 
important  secondly  because  strategies  need  to  be 
specialized  to  the  needs  of  a  particular  group. 
Statistics  is  a  discipline  that  is  applied  in  other, 
"ground",  domains.  Current  knowledge 
representation  and  language  generation 
techniques  are  not  adequate  to  producing  a  tool 
that  will  speak  physics  with  physicists  and 
psychology  with  psychologists.  An  alternative 
to  one  broad  tool  is  a  tool  that  is  readily 
specialized.  However,  the  first  two  problems 
would  make  this  difficult:  to  specialize  the 
program  a  local  statistician  would  have  to  learn 
a  language  used  by  no  other  program  in  the 
world,  and  the  modifications  made  might 
inadvertently  destroy  some  capabilities  of  the 
strategy. 
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One  valuable  insight  gained  from  building  REX 
is  an  abstract  view  of  its  strategy  that  we  believe 
can  be  transferred  to  other  data  analysis 
techniques.  A  practical  data  analysis  consists  of 
an  attempt  to  use  a  simple  technique  that  is  well 
understood  (by  statisticians!).  However,  its  use 
is  subject  to  a  number  of  assumptions  which 
may  or  not  hold  in  a  particular  data  set.  When 
an  assumption  is  violated,  either  the  data  must 
be  modified  to  fit  the  simple  technique,  or  a 
more  advanced  technique  must  be  used.  In 
other  words,  it  has  been  possible  to  view  data 
analysis  as  a  diagnosis  problem  (although  not  all 
statisticians  agree!)  This  view  is  "meta¬ 
knowledge"  about  data  analysis  which  has  been 
built  into  Student,  as  desaibed  below. 

3.  Requirements  for  Learning  By  Example 

The  necessity  of  working  examples  to  construct 
a  data  analysis  strategy  suggests  examining  the 
possibility  of  acquiring  strategies  directly 
through  some  process  of  working  examples. 
The  previous  discussion  suggests  that  the 
process  would  need  to  assist  the  user  in 
establishing  consistency  across  all  examples 
worked,  and  should  not  require  the  statistician 
to  learn  an  obscure  language. 

I  am  suggesting  that  progress  in  knowledge 
acquisition  is  possible  through  restriction  of  the 
domain  of  knowledge  to  acquire.  An  issue  for 
this  approach  is  whether  the  restricted  domain 
is  broad  enough  to  be  worth  the  difficulty  of 
constructing  a  special  tool.  For  data  analysis.  1 
believe  the  answer  is  yes.  A  human  statistician 
is  typically  expert  in  one  or  a  few  types  of  data 
analysis,  while  a  dozen  data  analysis  techniques 
would  cover  the  bulk  of  data  sets  analyzed 
[Snee,  1980).  One  might  ultimately  distinguish  a 
few  dozen  data  analysis  techniques.  Therefore, 
many  statisticians  will  be  needed  to  construct  a 
reasonably  comprehensive  data  analysis  expert 
system. 

A  program  by  example  system  is  enticing  for 
other  reasons.  First,  it  would  be  useful  for  the 
study  of  statistical  strategies  to  collect  multiple 
strategies  for  the  same  type  of  data  analysis. 
Combination  of  knowledge  from  multiple 
experts  is  an  open  problem  in  expert  system 
construction.  I  view  collection  of  a  body  of 
strategies  from  multiple  experts  as  a  necessary 
precursor  to  serious  study  of  this  problem. 
Second,  a  statistician  at  a  specific  location  could 


specialize  the  system  by  working  examples 
typical  of  local  practice.  The  value  of 
specialization  '  was  discussed  in  the  previous 
section. 

A  few  systems  previously  developed  come  to 
mind  in  considering  construction  of  an  expert 
system  by  working  examples.  Teiresias  [Davis, 
1979]  is  the  chief  example  of  a  program 
designed  for  interactive  transfer  of  expertise  to 
an  expert  system.  The  mode  of  using  Teiresias 
was  to  be  that  of  selecting  of  an  example, 
letting  the  system  run  until  it  made  a  mistake, 
eliciting  the  key  piece  of  knowledge  to  avoid  the 
mistake,  and  adding  the  new  knowledge.  The 
system  therefore  operates  by  acquiring  an 
additional  piece  of  knowledge  presumed 
consistent  with  that  previously  acquired.  In 
addition  to  adding  consistent  knowledge, 
however,  there  are  two  other  major  problems 
that  need  to  be  solved  for  a  practical  learning  by 
example  system. 

First,  the  system  must  support  the  acquisition  of 
a  first  example  or  rule.  In  a  production  system, 
the  Hrst  rules  acquired  are  typically  different 
from  later  rules,  because  the  system  uses  a  core 
of  rules  to  encode  control  information.  A 
subject  matter  expert  will  not  be  able  to  provide 
control  information. 

Second,  the  system  must  support  deliberate 
change  '  *o  the  knowledge  base  over  time.  We 
need  to  o  ctly  determine  the  consistency  of 
new  ex'*’  with  previous  examples,  not  just 
assume  u.  We  do  not  want  to  take  a 
"debugging”  attitude,  but  one  of  showing  what 
is  right  the  first  time. 

On  the  other  hand,  there  are  some  systems  that 
support  programming  by  example,  although 
none  of  them  are  for  construction  of  expert 
systems.  Tinker  (Licberman,  1983],  PHD 
[Attardi  and  Simi,  1983],  SBA  (Zloof  and  De 
Jong,  1977],  and  a  system  by  Bierman  and 
Krishnaswamy  |I976]'  arc  examples.  Attardi 
and  Simi  review  several  of  these  systems,  which 
are  designed  for  office  automation 
programming.  Tinker  appears  to  be  the  closest 
to  our  ideas  for  Student. 

In  using  Tinker,  the  programmer  selects  a 
concrete  typical  example  of  data  for  the 
procedure.  He  then  performs  the  procedure 
step  by  step.  The  system  is  therefore  able  to 
learn  how  to  do  the  first  example.  As  more 
examples  are  supplied,  the  program  required  for 


them  is  compared  with  the  already  constructed 
program.  If  the  two  differ,  the  user  is  queried 
for  a  predicate  that  will  distinguish  the  two 
cases.  Therefore,  the  user  ultimately  provides 
one  example  for  each  branch  of  the  final 
program. 

Tinker  seems  to  assume  that  the  user  knows 
how  each  example  should  be  worked;  there  is  no 
means  to  change  the  program  by  deleting  an 
example  already  worked.  The  way  a  particular 
da,a  analysis  should  be  done  is  not  cut  ind 
dried,  and  indeed,  the  statistician  is  typically 
learning  about  a  particular  example  while  doing 
the  analysis.  I  have  built  into  Student  some 
means  of  modeling  what  the  statistician  has 
learned,  or  may  have  learned,  to  capitalize  on 
this  opportunity  for  knowledge  acquisition.  I  do 
not  yet  know  how  effective  this  will  be. 

On  the  other  hand.  Tinker  is  tackling  a  harder 
problem  in  that  it  hopes  to  support  Lisp 
programming  of  any  procedure.  Lieberman 
demonstrates  its  level  of  success  in  this  by 
creating  a  simple  editor.  It  is  an  encouraging 
demonstration.  Tinker’s  use  of  menus, 
pointing,  and  question  answering  are  suggestive 
techniques. 

4.  Preliminary  Experience  with  Student 

Student  is  a  system  designed  to  allow  a 
statistician  working  alone  to  build  an  expert 
consultation  system  in  a  data  analysis  area.  A 
first  phase  has  been  implemented.  The  first 
phase  is  designed  to  acquire  the  first  example  in 
any  data  analysis  area. 

Student  can  be  operated  in  two  modes  -- 
consultation  mode  and  learning  mode.  In 
consultation  mode,  it  will  work  functionally  in  a 
manner  similar  to  REX,  suggesting  acceptable 
ways  to  analyze  a  given  data  set.  Since  it  is 
general  to  the  extent  of  data  analysis,  it  would 
handle  a  much  wider  range  of  problems  than 
REX  does,  given  the  requisite  strategic 
information. 

Student  is  able  to  acquire  the  first  example 
because  it  is  limited  to  data  analysis,  and  is  not 
a  general  purpose  tool  for  learning  arbitrary 
things  by  example.  In  particular,  the  meta¬ 
knowledge  about  what  a  practical  data  analy.sis 


is,  inferred  from  building  REX,  is  built  into 
Student.  This  meta-knowledge  is  represented  as 
a  network  of  eleven  types  of  frames,  as  showm 
in  the  following  table. 

input  variables 
data  types 
assumption  testing 
plot 

generic  plot 
test 

generic  test 

action 

question  discriminator 
predicate  discriminator 
words 

Each  type  of  frame  has  its  own  set  of  slots, 
which  represent  the  things  that  must  be  known 
in  order  to  carry  out  a  consultation.  When  a 
slot  has  not  been  filled,  the  system  knows  that  it 
doesn't  know  that  information.  It  can  then  do 
something  to  acquire  the  information,  which  is 
usually  Just  to  ask  the  statistician. 

Student  manages  two  major  data  structures. 
One,  the  strategy,  has  just  been  discussed.  The 
other  is  a  second  network  of  frames  that 
represents  a  trace  of  the  analysis  of  the  current 
example.  It  is  built  of  three  types  of  frames; 
entry  points,  decisions,  and  actions.  The  trace 
can  branch  at  each  decision  point,  if  the  user 
gives  more  than  one  response  (at  different 
times)  to  a  question  jmsed  Student.  A 
decision  frame  records  all  the  responses  to  a 
given  question,  and  book  keeping  information  to 
uniquely  express  the  set  of  answers  effective  at 
a  given  point  in  the  trace.  The  action  frames 
represent  each  side  effect  action  taken  by  the 
program.  The  entry  points  are  created  each 
time  an  assumption  testing  frame  is  begun  in  the 
strategy.  They  allow  the  user  to  return  to  the 
same  exact  context  in  which  the  frame  was 
begun  at  any  time.  The  u..er  can  then  reach  any 
decision  previously  made  by  stepping  through 
decisions  to  be  left  standing. 

These  two  data  structures  support  phase  1  and 
have  been  designed  with  an  eye  towards  work 
on  phase  2  (acquiring  an  additional  consistent 
example)  and  phase  3  (acquiring  an  inconsistent 
example).  The  remaining  paragraphs  in  this 


section  discuss  how  consistency  and 

inconsistency  are  expected  to  be  handled. 

The  analyses  demonstrated  by  the  statistician 
are  assumed  to  be  acceptable  analyses  of  the 
examples  (as  judged  by  a  statistician).  A  major 
focus  of  design  in  Student  has  been  to  assure 
that  as  a  data  analysis  strategy  evolves,  all 
previous  analyses  remain  acceptable  analyses  (as 
judged  by  Student’s  strategy).  This  is  the  basic 
test  of  consistency.  Points  at  which  consistency 
is  not  obvious  have  been  found  to  fall  in  four 
categories:  provably  consistent,  mechanically 
consistent,  mechanically  checkable,  and 
provably  inconsistent.  A  provably  consistent 
change  results  when  pre-specifiable  data  is 
sufficient  to  prove  consistency.  A  mechanically 
consistent  change  results  when  information 
needs  to  be  gathered  by  reexamining  previous 
examples,  but  the  result  must  be  a  consistent 
strategy.  A  mechanically  checkable  change 
requires  reexamination  of  prior  examples  in 
order  to  show  consistency,  and  the  review  may 
establish  inconsistency.  A  provably  inconsistent 
change  results  when  pre-specifiable  data  is 
sufficient  to  prove  inconsistency. 

Treatment  of  inconsistent  changes  rests  on  how 
the  trace  of  the  latest  example  is  related  to  the 
accumulated  strategy.  Each  example  worked 
produces  a  trace  with  all  the  information 
gathered  from  the  statistician.  Each  trace 
represents  an  example  worked  in  the  context  of 
the  strategy  accumulated  to  that  point,  and  the 
strategy  changes  called  for  by  the  trace  are 
guaranteed  to  be  consistent  with  that 
accumulated  strategy.  Therefore,  an  ordered  set 
of  traces  is  a  kind  of  "source  code"  from  which 
it  is  possible  to  "compile"  an  integrated  strategy 
consistent  with  all  the  examples  represented  in 
the  traces. 

A  provably  inconsistent  change  will  conflict  with 
parts  of  the  the  traces  of  some  prior  examples. 
Those  parts  will  have  to  be  reworked  manually, 
and  it  is  a  service  to  isolate  them  for  attention. 
The  remaining  parts  of  the  traces  can  be 
retained,  assuming  that  the  actions  based  on  the 
(incorrectly  derived)  data,  although  incorrect  for 
the  example,  were  correct  for  the  data.  The 
result  will  be  a  tree  of  partial  strategies,  each 
branch  representing  an  inconsistent  difference 


between  two  strategies.  Bach  node  represents  a 
strategy  which  can  be  derived  by  integrating  the 
ordered  set  of  traces  from  the  root  to  the  node. 

5.  Summary 

REX  is  a  working  demonstration  of  the 
feasibility  of  expert  systems  for  data  analysis.  It 
has  several  strengths:  a  convenient  user 
interface,  ability  to  solve  standard  textbook 
regression  problems,  and  a  modest  ability  to 
explain  the  reasons  for  its  suggestions. 
However,  it  also  has  limitations,  mainly  in 
supporting  strategy  acquisition,  modification, 
and  specialization. 

Student  has  been  designed  to  build  upon  REX's 
strengths  while  overcoming  its  limitations. 
Student  will  allow  statisticians  to  construct  or 
extend  knowledge  based  consultation  systems  by 
working  examples  and  answering  questions. 
This  will  provide  easier  and  faster  construction 
of  belter  consultation  systems  in  data  analysis. 

The  proposition  that  Student  explores  is  that  by 
restricting  the  domain  within  which  knowledge 
can  be  acquired,  significant  assistance  in 
knowledge  acquisition  is  possible.  The  control 
information  needed  to  structure  the  first 
example  can  be  provided.  Tbe  information 
necessary  to  prove  whether  a  change  of 
knowledge  is  consistent  can  be  specified  and 
collected.  Support  for  changing  inconsistently 
with  some  previous  examples  can  be  provided. 
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Data  analysis  management  is  a  methodology  intended  to  increase  the 
productivity  of  the  data  analyst.  A  primary  entity  for  data  analysis 
management  is  the  ’save-state”,  a  collection  of  metadata  and  data  that 
captures  a  state  of  the  analysis.  The  analyst  may  create  a  save-state 
to  designate  a  milestone  of  the  analysis.  The  save-state  may  be  used 
later  to  return  to  that  milestone  by  restoring  the  conditions  of  the 
analysis  that  existed  at  the  time  the  save-state  was  created.  Scientist 
at  Pacific  Northwest  Laboratory  (PNL)  have  developed  a  prototype  data 
analysis  management  system.  In  that  system,  a  save-state  includes 
pointers  to  the  data  sets  and  command  procedures  active  at  the  time  the 
save-state  was  created,  active  plot  descriptions  and  other  graphics 
parameters,  and  comments  supplied  by  the  analyst.  Associated  with  each 
save-state  is  a  record  of  the  sequence  of  commands  or  operations  used  to 
accomplish  the  transition  from  the  previous  (parent)  save-state. 

Metadata  also  describes  the  overall  relationships  between  the  save- 
states  that  have  been  created  during  the  analysis. 


NATURE  OF  THE  DATA  ANALYSIS  PROCESS 

For  the  past  several  years,  a  team  of 
computer  scientists  and  statisticians 
working  on  the  Analysis  of  Large  Data 
Sets  Project  (ALDS)  at  the  Pacific 
Northwest  Laboratory  (PNL)  has  been 
investigating  the  nature  of  the  data 
analysis  process  (1,2,3,41.  There  were 
several  motivations  for  this  work.  We 
wanted  to  understand  the  analysis 
process  better.  We  wanted  to  provide 
better  tools.  We  hoped  we  could  learn 
more  about  how  ar  expert  data  analyst 
worked  in  order  to  help  less  experienced 
analysts. 

When  we  examined  the  data  analysis 
process,  we  were  able  to  identify 
several  properties  of  it.  The  process 
tends  to  be  iterative  with  similar 
operations  applied  repeatedly  for 
different  data  sets  and  subsets.  The 
process  tended  to  be  exploratory.  The 
analyst  has  some  basic  ideas  of  how  to 
approach  the  analysis  at  the  onset,  but 
the  direction  the  analysis  takes  often 
results  from  the  knowledge  gained  from 
previous  points  in  the  analysis. 

Because  of  this,  data  analysis  is  best 
pursued  interactively.  It  is  very 
difficult  to  write  the  complete  script 
for  the  analysis  before  it  begins. 

The  analysis  process  can  result  in  many 
dead  ends.  Because  of  its  exploratory 
nature,  the  analyst  may  try  several 
approaches  that  simply  don't  work.  This 


fact  is  often  not  apparent  when  the 
final  results  of  the  analysis  are 
presented,  since  only  the  successful 
results  are  presented.  However,  it  is 
useful  to  keep  track  of  these  dead  ends. 
They  can  be  useful  in  showing  that  the 
analysis  was  rigorous  and  complete  and 
that  reasonable  alternatives  were 
investigated. 

The  analyst  may  have  several 
alternatives  to  explore  at  various 
points  in  the  analysis.  Since  only  one 
alternative  can  be  dealt  with  at  a  time, 
it  is  useful  to  be  able  to  return  to 
previous  points  in  the  analyses  in  order 
to  try  another  alternative. 

The  process  is  characterized  by  fits  and 
starts,  dead  ends,  and  decision  points 
with  many  options  to  explore.  Although 
we  can  think  of  the  process  as 
proceeding  linearly  through  time  from 
beginning  to  end,  the  process  really  has 
more  structure  to  it  than  that.  Rather 
than  representing  the  process  as  a 
straight  line,  the  process  is  better 
characterized  as  a  tree  where  the  nodes 
of  the  tree  represent  significant  points 
in  the  analysis,  which  we  call  "save- 
states,”  and  the  lines  between  the  nodes 
represent  the  steps  in  the  analysis  that 
took  place  to  create  the  child  node  from 
the  parent  node.  Figure  1  shows  how 
such  a  tree  can  be  depicted  graphically. 
From  any  point  in  the  analysis 
designated  as  a  save-state,  the  analyst 
can  proceed  mtil  a  significant  point  in 
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the  analysis  is  reached.  This  point  can 
be  defined  as  a  new  node  of  the  tree. 

The  analysis  can  proceed  on  from  that 
point  or  the  analyst  can  return  to  a 
previous  node  in  the  analysis  and  begin 
another  path.  By  allowing  the  analyst 
to  go  back  to  previous  nodes  and 
proceeding  from  that  point,  a  tree  can 
be  created.  This  graphical 
representation  also  depicts  where  the 
analyst  is  currently  working.  The  star 
at  the  end  of  a  line  segment  indicates 
that  the  analyst  is  currently  proceeding 
down  the  path  indicated  by  the  line 
segment  and  the  analyst  may  create  a  new 
node  at  any  time.  The  new  node  will 
replace  the  star  in  the  graph. 


COMMON  TOOLS  FOR  DATA  ANALYSIS 

There  are  a  number  of  tools  available  as 
aids  in  performing  computer-based  data 
analysis.  Although  these  tools  have 
improved  steadily  over  the  past  several 
years,  there  is  very  little  to  help  the 
analyst  manage  the  process.  The 
sections  below  discuss  the  desirable 
characteristics  of  data  analysis  tools 
and,  with  the  exception  of  the  section 
on  statistical  functions,  the  areas  in 
which  they  lack  capabilities  for  helping 
the  analyst  manage  data  analysis. 

Statistical  Functions.  Most  statistical 
analysis  packages  are  built  around  a 
library  of  statistical  functions  that 
can  be  applied  to  data  sets.  However, 
no  package  can  anticipate  (or  afford  to 
develop  and  maintain)  all  functions  the 
analyst  may  want  to  apply.  Some  systems 


such  as  ATiT  Bell  Laboratories'  S  System 
1 5)  have  been  designed  to  be  extensible 
so  that  the  analyst  can  add  new 
functions  as  they  ate  identified  and  can 
run  functions  available  in  languages 
such  as  FORTRAN  within  the  environment 
provided  by  S. 

At  PNL,  we  have  always  attempted  to 
build  systems  which  can  utilize  existing 
data  analysis  packages.  The  work  we 
have  done  in  data  analysis  management 
was  built  using  S  as  a  base. 

Data  Management ■  Most  systems  for  data 
analysis  have  facilities  for  data 
management  that  allow  the  analyst  to 
organize,  store,  and  retrieve  data.  All 
provide  capabilities  for  data  to  be 
brought  into  the  system  for  analysis  and 
for  data  and  results  to  be  displayed  and 
printed  as  output.  Some  systems  have 
better  facilities  for  organizing  data 
than  others.  Some  support  more 
complicated  data  structures  than  others. 
Some  allow  the  analyst  to  provide 
meaningful  names  rather  than  simply 
assigning  column  numbers  to  data 
variables  and  leaving  it  to  the  analyst 
to  keep  track  of  which  column  contains 
which  variable. 

It  is  often  useful  to  store  data  that  is 
derived  during  the  course  of  the 
analysis.  Some  of  this  derived  data  may 
consist  of  intermediate  results  that  can 
be  useful  in  later  phases  of  the 
analysis.  The  process  of  storing  and 
recalling  derived  data  should  be  easy  to 
perform. 

Both  raw  data  and  derived  data  need 
metadata  to  describe  characteristics  of 
the  data  itself,  such  as  the  data's 
source,  its  units,  how  it  was  calculated 
(if  derived),  what  missing  value  code(s) 
are  used,  why  it  was  generated,  and  what 
its  role  is  in  the  analysis.  It  is 
important  to  be  able  to  associate  the 
metadata  with  the  data  and  make  it 
easily  accessible  in  a  meaningful  way  to 
the  analyst. 

It  is  important  to  provide  the  analyst 
with  a  way  of  keeping  track  of  the  data 
sets.  Some  of  this  can  be  provided 
through  the  metadata  and  through  good 
naming  conventions,  but  current  packages 
provide  no  facilities  for  associating 
data  sets  with  particular  stages  of  the 
analysis.  The  analyst  has  no  automatic 
way  of  knowing  when  the  data  set  was 
analyzed  or  where  it  is  used  in  later 
stages  of  the  analysis. 

It  is  useful  if  the  analyst  can  record 
the  context  in  which  a  data  set  was 
created.  Only  the  analyst  can  provide 
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thiB  context  by  describing  such  things 
as  why  the  operations  that  created  the 
data  set  were  perforniedr  how  the 
operations  were  useful,  the  relevance  of 
the  operations,  why  the  data  is  being 
preserved,  and  what  Insights  were 
gained.  It  is  not  only  useful  to 
associate  this  context  with  the  data 
itself  but  also  with  the  portion  of  the 
analysis  process  in  which  the  data  set 
was  created  or  used. 

We  see  a  need  to  provide  the  analyst 
with  tools  for  recording  this  type  of 
information.  The  most  common  mode  is 
for  the  analyst  to  type  in  the 
Information  through  a  keyboard  — 
perhaps  using  an  available  text  editor. 
Another  way  to  capture  this  information 
is  by  using  an  audio  tape  recorder.  The 
analyst  can  dictate  insights  and 
comments  and  store  them  so  they  can  be 
played  back  later.  Our  tape  recorder  is 
computer-controlled  so  the  recorder  can 
automatically  advance  to  the  segment  of 
tape  containing  the  comments  relevant  to 
a  particular  save-state.  The  system 
should  be  designed  so  the  analyst  can 
use  the  mode  of  annotation  with  which 
he/she  is  most  comfortable. 

Graphics.  Graphics  is  recognized  as  an 
essential  tool  for  data  analysis.  It  is 
currently  used  during  all  phases  of  data 
analysis  including  data  checking  and 
validation,  data  exploration,  and  data 
confirmation  and  presentation.  However, 
it  is  often  difficult  to  regenerate  a 
given  graph.  In  order  to  do  so,  the 
data  sets  must  be  available  exactly  as 
they  were  before  and  the  conditions 
under  which  the  graphics  were  generated 
must  be  the  same.  Sometimes  it  is 
difficult  to  even  recall  when  during  the 
course  of  the  analysis  the  graph  was 
produced. 

Logs.  Many  statistical  analysis 
packages  will  record  the  course  of  the 
analysis  in  a  log  (also  called  a  diary 
or  journal) .  The  analyst  can  turn  the 
log  on  and  off  as  desired.  When  the  log 
is  turned  on,  all  the  commands  entered 
by  the  user  at  a  terminal  are  also 
written  to  a  file.  The  log  can  provide 
a  history  of  the  course  of  the  analysis, 
including  all  useful  commands,  non¬ 
useful  commands,  and  mistakes.  The 
analyst  can  also  insert  comments  into 
the  log  as  additional  documentation. 

Some  systems  permit  the  analyst  to  have 
results  (output)  added  to  the  log. 

Even  with  comments  Inserted  by  the 
analyst,  logs  can  often  be 
unintelligible  without  detailed,  time- 
consuming  study.  While  they  record  the 
actions  in  the  order  in  which  they 


transpired,  the  data  analysis  process  is 
not  strictly  linear  in  time.  As 
mentioned  earlier,  the  process  can  be 
depicted  as  a  tree.  One  of  the 
advantages  of  such  a  graphical  depiction 
of  the  course  of  the  analysis  is  that 
segments  of  the  logs  can  be  associated 
with  particular  nodes  in  the  tree.  The 
log  segment  that  is  associated  with  each 
node  is  the  set  of  commands  that  caused 
the  node  to  be  created  from  its  parent. 
This  technique  gives  structure  to  the 
log. 

Procedural  CaDabillties.  As  mentioned 
before,  the  data  analysis  process  is 
Iterative.  The  same  operations  are 
often  applied  to  several  data  sets  or 
subsets.  Analysts  routinely  create 
macros  (or  procedures)  consisting  of 
sets  of  commands  that  are  saved  and 
stored  as  entities.  These  procedures 
are  often  parameterized  so  they  can 
operate  as  needed  against  various  data 
sets.  Analysts  often  build  macros  from 
the  log.  The  log  is  edited  to  remove 
errors  and  superfluous  commands  and  then 
tested.  It  is  refined  and  debugged. 

When  the  analyst  is  satisfied,  the  macro 
can  be  stored  for  later  use. 

In  S,  macros  are  stored  in  structures 
similar  to  those  used  for  data.  The 
analyst  can  differentiate  between  macros 
and  data  because  the  names  of  macro  data 
structures  are  prefixed  with  "mac." 

Just  as  data  sets  should  be  associated 
with  portions  of  the  analysis,  it  is 
useful  to  associate  macros  with  portions 
of  the  analysis  in  order  to  identify 
where  they  were  created  and  where  they 
were  applied. 


THE  SAVE-STATE 

We  have  developed  a  new  methodology  to 
aid  the  analyst  in  managing  the  data 
analysis  process.  The  primary  entity 
for  managing  data  analysis  is  the  "save- 
state,"  a  collection  of  metadata  and 
data  that  captures  significant 
information  about  the  state  of  the 
analysis  at  a  certain  point  in  the 
analysis  process.  The  analyst  may 
create  a  save-state  at  any  time  during 
the  analysis.  Save-state  may  be  created 
for  any  number  of  reasons: 

-  The  analyst  may  wish  to  designate  a 
milestone  in  the  analysis  because  a 
significant  insight  was  gained  at 
that  point  in  the  analysis. 

-  A  decision  point  was  reached  in  the 
analysis  and  several  different 
alternatives  can  be  explored  from 
this  point  in  the  analysis. 


A  dead  end  was  reached  that  Is 
worthy  of  being  preserved  for 
documentation  purposes. 

A  more  significant  alternative 
needs  to  be  explored  but  the 
portion  of  work  Is  Incomplete  and 
the  analyst  must  return  to  It 
later. 


Once  a  save-state  exists •  the  analyst 
can  ’restore”  that  save-state  In  order 
to  resume  analysis  from  the  point  at 
which  the  save-state  was  created.  The 
effect  Is  as  If  the  analyst  had  moved 
back  in  time  to  the  point  at  which  the 
save-state  was  created. 

Information  associated  with  the  save- 
state  Includes  the  name  of  the  save- 
state,  the  data  and  time  the  save-state 
was  created  and  last  accessed,  the  name 
of  the  analyst  who  created  the  save- 
state,  the  states  of  various  icons 
(described  below)  that  are  part  of  the 
save-state,  a  list  of  the  data  sets  and 
macros  associated  with  the  save-state,  a 
list  of  plots  associated  with  the  save- 
state,  written  comments  entered  at  the 
keyboard,  and  Information  that  points  to 
verbal  comments  saved  on  cassette  tape. 
This  information  Is  sufficient  to  give 
the  analyst  a  quick  overview  of  what  the 
save-state  contains  and  why  it  was 
created.  The  analyst  can  ’scan*  the 
save-state  to  view  this  type  of 
Information  without  having  to  restore 
the  save-state  and  Incurring  the 
overhead  of  moving  data  sets  around. 

Besides  storing  the  save-states 
themselves,  information  Is  stored  that 
allows  the  relationships  between  the 
various  save-states  to  be  graphically 
depicted.  In  order  to  do  this,  the 
system  stores  an  internal  name  for  the 
save-state;  Its  title;  a  set  of  indices 
that  depict  the  parent,  child,  and 
sibling  relations  between  the  save- 
states;  a  flag  that  Indicates  whether 
the  save-state  has  actually  been  deleted 
and  only  a  place  marker  is  preserved; 
and  a  flag  that  indicates  whether  the 
save-state  is  the  currently  active 
state,  the  last  scanned  state,  or  is  an 
Incomplete  state  waiting  to  be  created. 

Also  associated  with  each  save-state  Is 
the  segment  of  the  log  that  contains  the 
set  of  commands  that  describe  the 
transition  between  the  state  and  its 
parent  state. 


In  order  to  provide  easy  access  to  the 
tree  of  data  analysis  save-states  and 
take  advantage  of  its  natural  structure, 
our  prototype  data  analysis  management 
system,  ADAH,  Is  graphics  based.  The 
prototype  has  been  Implemented  on  a 
Digital  Equipment  Corporation  VAX 
11/780.  The  high  resolution  graphics 
display  device  is  a  Ramtek  9400.  The 
audio  cassette  deck  used  for  recording 
and  playing  dictation  is  a  Yamaha  K-700 
cassette  deck.  As  mentioned  earlier,  we 
are  using  AT4T  Bell  Laboratories'  S 
statistical  analysis  system.  S  is 
running  under  Eunice,  a  UNIX  derivative 
that  allows  VMS  to  be  run  as  the  base 
operating  system  while  still  providing 
UNIX  functionality. 

The  tree  of  save-states  is  always 
present  on  the  high  resolution  color 
graphics  device  whenever  the  analyst  is 
performing  data  analysis  management 
functions  (see  Figure  1).  This  is  the 
same  device  that  is  used  to  display 
graphics  during  data  analysis.  The 
analyst  interacts  with  ADAM  through  a 
series  of  menus.  Priority  windows  {6] 
are  used.  Both  the  menus  and  the 
winoowB  are  based  on  the  principle  of 
successive  disclosure.  The  analyst 
controls  the  level  of  detail  displayed 
at  any  time.  The  analyst  can  select 
more  or  less  detailed  menus  or  graphical 
displays  of  the  save-states  and  log 
segments  as  desired. 

We  have  defined  three  classes  of 
functions  that  can  be  performed  using 
the  menus.  There  are  (1)  functions  that 
are  performed  on  save-states,  (2) 
functions  that  are  performed  on  segments 
of  the  log,  and  (3)  utility  functions. 
The  utility  functions  include  RETURN, 
which  allows  the  analyst  to  move  to  a 
higher  level  menu;  HELP,  which  provides 
help  on  the  menu  currently  displayed; 
HOVE  WINDOW,  which  allows  the  various 
winaows  on  the  screen  to  be  moved  from 
place  to  place;  and  S-HODE  which  allows 
the  analyst  to  exit  the  data  analysis 
management  mode  and  return  to  the  S 
statistical  analysis  package  to  perform 
further  analysis. 

The  functions  that  can  be  performed  on 
save-states  include  SCAN,  RESTORE, 
MODIFY,  SHOW  NETWORK,  ERASE  NETWORK, 
CREATE,  and  DELETE.  Each  of  these 
functions  is  discussed  in  more  detail 
below. 

The  SCAN  function  provides  the  analyst 
with  an  overview  of  the  save-state  being 
scanned.  Information  from  the  save- 
state  is  displayed  in  a  window  that 
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FIGURE  2.  Scanning  a  Save-State 


FIGURE  3.  Cceating  a  New  Save-State 


overlays  the  graph  of  the  save-states. 
The  SCAN  display  Is  shown  In  Figure  2. 

On  the  color  graphics  device)  the  save- 
state  being  scanned  Is  shown  In  a  color 
different  from  the  color  used  for  the 
other  save-states.  In  the  figure,  the 
save-state  being  scanned  Is  shown  with  a 
thickened  line.  The  Icons  Indicate 
special  characteristics  of  the  save- 
state.  The  light  bulb  icon  indicates 
that  special  Insight  was  gained  at  this 
point  in  the  analysis.  The  ear  icon 
indicates  that  the  analyst  has  dictated 
some  Ideas  on  cassette  tape  which  can  be 
played  on  a  computer-controlled  audio 
cassette  deck.  The  eye  icon  Indicates 
that  some  graphics  are  associated  with 
this  save-state  and  can  be  recreated  If 
desired.  The  keyboard  key  icon 
Indicates  that  the  analyst  has  keyed  In 
some  documentation  which  can  be 
displayed  In  a  window  If  desired.  The 
SCAM  function  can  be  performed  with  very 
little  overhead.  No  data  sets  are 
accessed  or  moved  except  the  small  data 
structure  that  contains  information  on 
the  save-state. 

The  RESTORE  function  allows  the  analyst 
to  move  back  to  a  previous  point  in  the 
analysis  at  which  the  save-state  was 
created.  Whenever  the  analyst  restores 
a  save-state  and  returns  to  the 
statistical  system  to  do  more  analysis, 
the  evolution  of  a  new  save-state  has 
begun.  The  RESTORE  function  requires 
that  data  sets  currently  In  the  working 
data  base  are  replaced  by  the  data  sets 
belonging  to  the  restored  save-state. 

The  MODIFY  function  allows  the  analyst 


to  directly  modify  the  save-state.  The 
analyst  can  modify  the  author  or  title 
of  the  save-state,  turn  the  icons  on  and 
off,  modify  documentation  associated 
with  the  save-state,  and  modify  the  list 
of  data  sets  and  macros  associated  with 
the  save-state  (although  this  does  not 
change  their  contents). 

Although  the  data  analysis  process  Is 
most  often  depicted  as  a  tree,  we 
recognize  that  the  process  is  not 
strictly  a  tree.  It  is  really  better 
characterized  as  a  network.  The  process 
becomes  a  network  whenever  the  analyst 
includes  a  data  set  in  a  save-state  that 
the  save-state  did  not  inherit  from  its 
parent  (e.  g.,  a  data  set  is  imported 
from  another  save-state  not  in  the 
current  analysis  path) .  However,  we 
recognizeu  that  continually  depicting 
the  network  would  make  the  display  so 
confusing  that  it  would  be  very 
difficult  to  get  a  good  overview  of  what 
was  going  on.  We  created  the  SHOW 
NETWORK  and  ERASE  NETITORK  functions  to 
allow  the  analyst  to  see  the  underlying 
network  structure  when  desired  and  to 
remove  it  in  order  to  restore  the 
uncluttered  tree  representation.  When 
the  network  is  displayed,  arrows  are 
drawn  from  the  appropriate  non-ancestral 
save-states  to  the  save-state  currently 
being  scanned  or  restored. 

The  CREATE  function  can  be  invoked  as 
desired  whenever  the  analyst  feels  that 
a  significant  point  in  the  analysis  has 
been  reached.  The  options  of  the  CREATE 
function  are  shown  in  Figure  3.  When 
the  analyst  creates  a  new  save-state. 


the  analyst  Is  prompted  for  a  title. 

The  analyst's  name  was  provided  as  the 
author's  name  at  the  beginning  of  the 
ADAM  session.  Both  the  title  and  the 
author  can  be  modified  if  desired.  The 
analyst  can  turn  icons  on  and  off(  add 
verbal  and/or  written  comments,  and 
include  or  exclude  datasets  and  macros 
during  the  creation  process.  When  the 
creation  process  is  complete,  the 
analyst  can  either  choose  the  option  to 
store  the  newly  created  save-state  and 
return  to  the  higher-level  menu  or  quit 
and  return  to  the  higher-level  menu 
without  creating  the  save-state.  The 
analyst  can  move  back  and  forth  between 
the  statistical  analys;.s  system  and  ADAH 
without  creating  save-states. 

When  a  new  save-state  is  created,  the 
star  (asterisk)  on  the  tree  that  marks 
the  current  point  in  the  data  analysis 
process  is  replaced  with  a  box 
representing  the  save-state.  The  newly 
created  save-state  becomes  the  current 
save-state  and  any  further  processing 
will  proceed  from  that  point  in  the 
process.  If  the  analyst  restores 
another  save-state,  processing  will 
proceed  from  that  point  Instead. 

The  DELETE  function  can  be  used  to  mark 
save-states  as  deleted.  The  analyst 
cannot  restore,  scan,  or  modify  a 
deleted  save-state.  The  deleted  save- 
state  appears  on  the  display  as  a  circle 
without  a  title  in  it. 

The  log  functions  are  SCAN  LOG,  SCAN 
PLOTS,  EDIT  LOG,  and  CREATE  MACRO.  The 
analyst  can  perform  any  of  these 
functions  on  any  log  segment.  When  the 
analyst  chooses  SCAN  LOG  or  SCAN  PLOTS, 
a  winoow  is  opened  on  the  graphics 
device  and  the  information  is  displayed 
as  typified  in  Figure  4.  There  may  be 
more  information  available  than  will  fit 
in  the  window.  The  analyst  can  scroll 
between  portions  of  Information.  If  the 
analyst  wants  to  edit  the  log  or  create 
macros,  ADAM  will  invoke  a  standard  text 
editor  so  the  analyst  can  edit  the  log 
segment  of  interest. 


FUTURE  DIRECTIONS 

ADAH  was  designed  by  a  group  of  computer 
scientists  and  statisticians  with  the 
needs  and  desires  of  the  statisticians 
in  mind.  Our  next  step  is  to  test  ADAH 
under  the  conditions  of  a  real  analysis. 
There  are  a  number  of  questions  we  are 
seeking  to  answer.  We  want  to  know  how 
well  the  concept  actually  works  in 
practice.  Our  experience  with  ADAM  will 
form  the  basis  for  the  next  generation 
data  analysis  management  system. 
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We  are  concerned  about  how  we  can  clean 
up  the  log  in  order  to  make  it  more 
intelligible  and  still  maintain  in  it 
what  is  necessary  and  sufficient  to 
replicate  graphics  and  restore  the  save- 
state.  Our  current  DELETE  command  only 
marks  a  save-state  as  deleted.  We  need 
to  determine  how  to  truly  delete  save- 
states  and  what  the  implications  of 
these  deletes  ate  with  respect  to  other 
save-states  which  share  the  same  data 
sets.  We  already  know  that  a  delete  of 
a  state  with  no  children  is  different 
from  a  delete  of  a  state  that  has 
several  children.  We  want  to 
investigate  whether  comments  recorded  on 
cassette  tape  are  really  useful  and  how 
their  usefulnesE  compares  to  comments 
that  are  typed  into  a  file  by  the 
analyst. 

The  environment  provided  by  machines 
designed  for  artificial  intelligence 
work  show  great  promise  for  both  the 
programmer  and  the  analyst.  We  are 
investigating  whether  these  machines  can 
provide  a  better  environment  in  which  to 
do  both  data  analysis  and  data  analysis 
management. 


*  This  work  was  supported  by  the  Applied 
Mathematical  Sciences  Group,  Scientific 
Computing  Staff,  U.  S.  Department  of 
Energy,  under  contract  DE-AC06-76RLO 
183B. 
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1  INTRODUCTION 

The  ability  of  statisticians  to  perform  calculations, 
both  numerical  and  non-numerical  has  changed 
radically  in  the  last  few  decades  and  the  pace  of 
Change  continues  to  increase.  In  providing  graduate 
students  with  appropriately  modern  training, 
statistics  departments  must  respond  by 
modernizing  both  computing  environments  and 
curricula.  These  are  intertwined,  the  course  serving 
needs  created  by  the  environment,  and  the 
environment  determining  some  choices  among 
topics  in  the  course.  This  paper  will  describe  the 
current  environment  at  Carnegie-Mellon  and  the 
content  of  a  course  that  we  believe  should  be 
taken  bv  all  Ph.  D.  students  in  statistics.  We  make 
further  introductory  remarks,  then  present  the 
resource  description  m  Section  2  and  the  course 
description  m  Section  3. 


1.1  The  Past  at  Carnegie-Mellon 

Thirty  years  ago  statisticians  did  then  computation 
on  desk  calculators,  As  recently  as  10  years  ago. 
the  CMU  Statistics  Oepanmeni  relied  on  the 
campus  computing  center  s  IBM  360/67.  Course 
work  was  primarily  theoretical,  using  pencil  and 
paper  exercises  and  no  computing.  At  about  that 
tune  the  university  made  a  strong  commitment  to 
the  wide-spread  use  of  interactive  computing  for 
educational  purposes.  By  1980  CMU  was  acquiring 
about  one  DEC  2060  and  100  terminals  per  year  for 
the  central  computing  facility,  and  had  acquired 
software  such  as  BMOP,  SPSS.  MINiTAB.  IMSL. 
DiSSPLA.  and  TELLAGRAF.  These  facilities  are 
used  for  coursework  for  both  undergrad  and  grad 
students.  The  system  can  support  about  500 
simultaneous  users. 

In  7981.  the  department  began  acQuirrng  its  own 
computer  terminals;  in  1982,  we  purchased  our  first 
VAX.  Bv  the  time  this  appears  in  print,  the  CMU 
Statistics  Department  will  have  its  own  local  area 
network  with  at  least  six  personal  computers  and 
ten  workstations  (including  some  cotort.  Our  VAX 
has  an  attached  array  processor  and  we  provide 
<Mj»  own  I/O  faciines  (including  a  pen  plotter  and  a 
graphics  laser  printer). 

Wp  are  part  of  a  very  large  local  area  network 
witri  rTiore  than  250  nodes,  wtiicti  is  sctu'duied  to 
become  an  order  of  magnitude  larger  in  the  ne*l 
i8  months.  In  less  thari  five  years  we  nave  gone 
from  total  dependence  on  a  targe  central  computer 
♦  acii'tv  to  our  own  independent  opr’ration  based  on 
c'  •substantial  number  of  interconnected  macturies. 
Ciiir  Situation  has  changed  diamaticallv  and  will 
continue  to  change,  it  is  our  jot>  to  adapt  our 
graduate  programs  to  the  new  environment. 


1.2.  Intelligent  Consumption 

Computer  hardware  and  computer  software  have 
become  an  integral  part  of  our  daily  activities.  We 
find  ft  necessary  to  devote  substantial  effort  to 
keeping  abreast  of  developments  in  both  arenas  so 
that  ovH  envuonment  continues  to  improve.  We 
think  It  IS  wise  to  transfer  some  of  what  we  learn 
to  our  studen's.  as  they.  too.  will  soon  make  such 
decisions  wherever  they  might  be. 

At  the  same  time,  because  we  do  not  yet  have 
essentially  unlimited  computational  resources,  we 
have  to  be  constantly  aware  of  the  limitations  of 
our  environment,  in  terms  of  both  numerical 
accuracy  and  also  computational  efficiencv-  Again, 
we  think  It  IS  good  to  transfer  this  avareness  to 
our  students.  Om  motive  in  this  case  is  partly 
selfish;  graduate  students  can  have  a  negative 
impact  on  our  shared  environment  if  they  do  not 
appreciate  the  various  tradeoffs  amongst  the 
resources  available. 


1  3.  Curriculum 

Computing  is  an  integral  pan  of  the  curriculum  at 
all  levels  of  study  in  statisitics  at  CMU.  Viriually 
all  courses  other  than  probability  theory  and  the 
theory  of  inference  make  moderate  to  heavv  use 
of  out  computing  facilities.  We  summarize 
coittputationai  activity  within  our  program  according 
to  level  of  study. 

1.  Undergraduate  Introductory  courses  tor 
routine  elementary  data  analysis:  Special 
topics  courses  such  as:  (i)  Statistical 
Software  Packages  and  (n)  Elements  of 
Statistical  Computation. 

2.  Masters  Degree;  Data  analysis  in  the 
various  statistical  methodology  courses; 
Special  topics  courses. 

3.  Doctoral  Degree-  Advanced  Data 
Analysis  coursework;  Staiisiticai 
Computing  coursework;  Advanced  topics 
and  seminars. 

4.  Sfiecialist  in  Computation.  Software 
design,  theoretical  work  on  algorithms; 
numerical  analysis. 


2  RESOURCES 

We  list  some  of  the  hardware  and  software 
res(*urces  available,  and  then  discuss  the  approach 
taken  at  CMU- 
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2  1  Hardware 


2.1.1  Microcomputers 

The  IBM'PC  (and  its  variants)  and  Appie'Maciniosh 
are  fairly  powerful  machines,  particularly  when 
conipa^ed  with  what  was  available  in  a  central 
faciiitv  a  decade  or  two  ago.  Random  access 
memorv  used  to  be  a  scarce  resource;  now, 
peisonai  computers  may  have  half  a  megabyte  of 
sic'raqe  Of  more.  Some  statistics  departments  rely 
nea-'iK  on  them,  and  many  students  will  eventually 
i>»  (I'.iing  much  of  their  work  on  these  machines. 

'ifstantiai  increase  m  the  value  of  personal 
ripiiiefs  occurs  when  thev  are  (inked  logethei  in 
ai  area  network.  We  say  more  on  this  point 
tM'U  y...  we  will  also  briefly  discuss  software  for 
hmci  ocomputers. 


2  12  Workstations 

A  workstation  is  a  high  resolution  graphics 
lemimai  connected  to  a  dedicated  host  computer. 
A  workstation  offers  an  improved  environment  for 
most  compiJiing  tasks,  including  data  analysis  ar>d 
sofiwaie  development.  Muiiple  windows  allow  one 
l(  perform  a  variety  of  tasks  neariv 

‘  inuiitaneouslv.  For  example,  a  data  analyst  can 
ai  a  dataset  plotted  in  several  different 
pfoirctions  at  the  same  time,  or  can  look  at  plots 
i'‘  several  datasets  side  by  side,  or  car  compose 
tr>*  iri  one  window  while  displaying  plots  m 
annthri, 

LiKi'  personal  computers,  workstations  are  more 
^,aiuat)ie  when  connected  in  a  local  area  network. 
Ttie  disadvantage  of  a  timesharing  system  is  that 
wiiti  many  potential  users,  the  system  is  often 
overloaded.  Adding  a  network  of  workstations  to 
in>  sysiem  allows  the  individual  users  access  to  a 
system  itiai  is  essentially  independent  of  ttie 
numtipr  (W  simultaneous  users.  Our  exfierience  with 
worksiaiions  is  very  positive  and  we  find  the 
comrnunicaiion  capabilities  imparted  by  a  network 
to  tre  essential. 


2  13  Printers  Plotters,  and  Terminals 

Ffintof!  aocl  ploltrrr  are  necossai..  and  locai 
Of  < 'flMr  1  loii  (if  ftood  ciuaiit »  tevi  and  liguies 

conv  riiif-rit.  Laser  printofS  af(‘  verv  mcr-  evetr  for 
oon  pfruluction  documents,  but  for  r(>utifif  nanl- 
copv  output  a  line  punter  and  an  irrcxpensi  vr* 

plfdter  Will  suffice.  Graphics  terminals,  however, 
varv  sulisf antiady  in  providma  the  capabilities  that 
are  e'lSeniial  for  some  research.  Since  it  is  likely 
dial  pMcev  Will  continue  to  come  dc'wn,  arrd  the 
ii<vf  ('f  (}fai)fiicai  methods  of  statistical  analysis 
Will  (r-nimue  t<>  increase,  d  i!  a  distinct  advantage 
ff»»  ‘.iiKfeiii*  io  become  familiar  wdfr  locally 

P' fujr  animal 'le  (irapfiics  leimmais  iwhether  slow. 

iiKf  tru  DFC  GiGi.  or  fasi  and  pc'werful,  like  the 
Tekirnru*  Driven  fu  fiosl  computers  that  are 

avaiiaruf  inr  oenera'  computing  as  well,  these 

drv'cr''  can  be  less  expensive  alternatives  to 

stand  a'line  workstations 


2  14  Parallel  Processing 

Wr  tiavp  ipcenilv.  added  an  arra.  processor 
latiached  processor)  to  our  hardware  riabie  M  has 
roughu  the  power  of  an  IBM  3083.  tmi  only  costf 


about  $25,000.  We  don  t  vet  have  enough 
experience  with  it  to  make  useful  statements  about 
ns  rote  in  training  students,  but  we  feel  that  there 
IS  much  potentiaT  gam  from  parallel  computation 
for  statistics. 


2. 1.S.  Networks 

Networks  come  m  various  flavors.  We  have 

access  to  several  natioriai  networks  such  as  Bitnet 
and  Telenet,  as  well  as  an  extensive  Local  Area 
Network  (LAN);  the  best  guess  is  that  we  have 
about  250  machines  on  our  LAN  but  some  of  them 
are  located  in  Cleveland.  New  York  Cits. 
Poughkeepsie,  and  elsewhere  so  the  term  local  is 
somewhat  abused. 

In  1982  CMU  and  IBM  signed  an  agreement  i(> 
develop  a  prototype  personal  computing  network 
The  goal  is  to  provide  all  students,  faculty  and 
professional  staff  with  access  to  personal 
computing  workstations  integrated  into  a  network 
which  will  provide  access  to  data-bases  such  as 
the  library  card  catalog,  communications  via  mail 
and  bulletin  boards,  and  software.  With  the 
development  of  effective  tools  for  non-numenc 
data  processing  (eg:  text  processing,  graphics,  etc), 
even  departments  m  the  liberal  and  fine  arts  are 
rapidly  expanding  their  use,  and  incorporating 
computing  into  their  curriculum. 

The  CMU  distributed  network  will  have  the 

following  features: 


1. 

Independent 

access:  Access 

tc> 

a 

persona 

1  computer  workstation 

and 

ii‘ 

per  fo»manc(» 

IS  not  affected 

t)\ 

the 

numbf'i 

network 

simultaneous  users 

on 

the 

2. 

Flexible 

bccpsk:  Usefs  can  eniei 

the 

system 

front 

any  suitably  equipped  s 

ite. 

for  example  a  suitably  equipped 
workstation  at  home. 

3.  Multiple  windows:  Users  will  be  able  to 

maintain  several  contexts 

simultaneously,  moving  easily  from  one 
task  to  another. 

4.  Communications:  Users  will  be  able  to 
communicate  with  each  other  through 
the  network.  There  are  a  mail  facilities, 
file  transfer  capabilities,  and  central 
database  access. 

5.  Multi-media  capaLulities:  The  system 

will  be  able  to  generate,  transmit,  and 
store  video  information,  including  both 
static  and  dvnamic  images.  There  are 
piarrs  for  audio  capabilities  as  writ 

6.  ExpanditiiliU  Currpntls  trie  s/stem  has 
about  50  wfirkr.iations-  Within  .  veais 
the  svstern  will  expand  to  thousands  of 
woiksiations 

7.  Cost-effectiveness  Tnr  price:  of 

personal  computers  are  deenmruj 
relative  to  cornpuling  power  much  more 
rapidly  lhah  the  prices  of  large  scau- 
time-shanng  corriputers. 

The  planned  envtronment  has  four  system 
elements 


1  Pf’fsof^dJ  compiHPf  workstations  31’-bil 
processfHs  capable  of  r'xpcutinq  1 
4^(lhon  insifiictions  per  second,  havirtq  l 
A^epabvfe  of  merriory  with  a  f  A^iilion 
pixel  dispiav  of  bit-map  graphics  t  a 
machine).  The  machine  wil  have  no  disk 
drives  do  keep  the  price  in  the  ranoo 
$3000-$6000). 

2.  File  servers  with  locil  disk  storarje  and 
other  special  facilities  such  as  laser 
printers,  optical  scanners,  etc. 

3.  A  .communications  network  iinvinq  tr>r 
workstations  to  fde  servers  and  the 
central  facilities. 

4.  Central  computing  facilities  lo»  large 
scale  online  storage.  larcje  scale 
computation.  and  other  speriau/ed 
services. 

Since  the  cost  of  personal  computmn  facilities  is 
decreasing  more  rapidly  tfian  that  o*  laine  sv  stents, 
this  approach  appears  to  be  ihr-  least  e»(»ensive 
wav  to  provide  access  to  contpuimo  laciiities  tor 
the  campus  community. 


2  16  Computing  in  Our  Department 

We  have  15  FTE  faculty,  about  30  graduate 
students,  and  6  administrative/secretanai  ‘.taff  iTne 
staff  are  an  integral  part  of  our  environment  and. 
in  fact,  are  the  only  ones  with  guaranteed  access 
to  our  VAX,'  Our  mam  processor  is  a  VAX  11/750 
with  4MB  of  memory.  900fVlB  of  disk  storage,  a 
magnetic  lape  drive.  24  terminal  lines.  3  distinct 
network  interfaces  and  a  floating-point  accelerator. 
Our  terrTimais  are  connected  to  it  through  a  large 
central  switching  facdity  which  provides  terminals 
the  opporiunitv  to  connect  to  any  of  a  rumiber  of 
oincr  computers  on  the  campus  'and.  equally, 
provides  other  terminals  the  opportunity  to  connect 
to  our  computer).  In  addition,  wr  nave  5  IBM 
PC  XT  personal  computers.  2  Appie  Macmtosh 
ppfsor^ai  computers,  2  SUN  2  120  woi*- si  at  tor>.^.  and 
IPs  the  time  this  appears  in  print'  VAksration  n 
workstations  leach  with  3Me  of  nmtiwi.  a  30MB 
disk,  and  ttie  power  of  a  VAX  li  780'  a»>d  a 
Tektronics  4125  color  workstation. 


22  Software 

There  are  several  categories  of  software  that  are 
relevant. 

1.  Operating  Systems  UNiX  o-  ciearK 
becoming  the  most  wideiy-used 
operating  system  for  min»-  and 
microcomputers.  Students  should  ger 
some  experience  vvdfi  it.  On  if>r  oiru*' 
hand,  detailed  knowledge  of  opetatuuj 
systems  is  rarelv  of  greai  ir-r*  k 
statisticians.  lOne  exception  is  wf>en  on< 
has  to  handle  large  arravs  wini  vuiuai 
rnemorv  operaling  systems.) 

2.  Statistical  Packages  It  is.  of  coiirsr-. 
essential  for  stpdenis  to  get  e>(ienerice 
With  the  most  common  statistical 
packacjes,  such  as  BMOP  and  SAS,  and 
It  IS  also  helpful  for  them  to  use  itn 
newer,  extertsiftle  prc^grams  designed  lo» 
interactive  data  analysis,  sucfi  as  S  and 
ISP 


3.  Graphics:  Life  with  a  graphics  terminal 
is  easiest  when  there  is  a  good  libiaiv 
of  graphics  subroutines,  mcludmg,  if 
pc’SSdue,  some  for  performing 
transformation  and  rotation  locallv-  it 
d<*es  not  seem  especially  desirable  fnf 
most  students  to  program  in  a  low-ievoi 
language. 

4.  Sn’TOUline  Libraries.  Among  the  most 
important  tools  for  the  research  worker 
and  practitioner  is  the  subroutine  library. 
Gaming  an  ability  to  understand 
computational  aspects  of  a  problem  at  a 
depth  sufficient  for  writing  good 
programs  that  make  use  of  high  quality 
subroutines,  such  as  those  in  IMSL  and 
LINPACK,  should  be  a  central  goal  of 
computing  education  for  graduate 
studems  m  statistics. 

6.  Symbolic  Computing  Statistical 

problems  are  being  solved  with  the  aid 
of  Symbol  manipulators,  such  as 
MACSYMA.  iSee  Gong.  1983. ‘  Like 
faculty,  students  will  benefit  bv  having 
a  manipulation  package  available. 

6.  Data  Base  Managentent:  Although  data 
base  management  systems  are  not  often 
appreciated  as  contributing  to  staiisfica' 
aspects  of  solutions  to  problems,  tneu 
great  utility  itiakes  experience  wiiri  them 
valuable  foi  students  who  win 
subsequently  work  with  large  data  sctf, 

7.  Microcomputer  Software  We  have 
examined  several  statistical  packages 
with  mosiiv  discouraging  results.  A 
detailed  review  (>f  one  reasonably  good 
package  is  given  bv  Schervish  11985'. 
Afte'  leaving  CMU.  some  of  ou' 
students  wii'  work  primarily  on 
microcom()uiers.  and  it  is  worthwhile  to 
give  them  trie  opporiuniiy  to  learn  about 
software  for  micros  while  they  are  here. 

8.  Text  processing  Faculty  and  students 

alike  make  use  ‘of  SCRIBE  for  document 
production  ranging  from  course  handouts 
to  articles  and  dissertations.  in 
conpi  irtion  with  compuiei 

conmninicat ions  facilities,  this  promises 
to  alter  the  way  many  of  us  conduct 
our  research.  For  example,  this  article 
was  a  collaboration  of  four  authors  who 
communicated  primarily  by  computer 
mail.  including  tiassing  drafts  arut 
revisions  back  and  fortn. 


3  CURRICULUM  FOR  PH  D  STUDENTS 

Computing  ha<-  liecome  a  l)asic  inc^i  for  tiotr»  trie 
theory  and  the  practice  of  statistics,  mucti  as 
measure  ttieoretic  protiatulits  is  a  basic  tr^oi  for 
mathematical  statistics,  and  should  tiavr  a  similar 
place  m  ttu*  cumcuiiiui.  Student'-,  even  tnnse  wtio 
are  not  planning  u  srx'Ciahre  in  statistical 
computing,  need  lu  l'‘  aw'aie  of  itie  theru,  and 
practice  of  computin'; 

We  outiinr  tierr^  a  one  semeslei  cour^r  m 
statistical  compuiiifq.  One  of  us  (Edd\i  lias  taugru  a 
siiTidat  course  srveiai  times,  and  a  related  course 
was  taught  t)v  two  of  us  together  lEddy  and  k.ass). 
At  Catnegie-Mellon,  this  course  is  presently 


iniecjratpd  jnio  a  iwo-semester  couisr  in  Data 
Anaiv5i‘.  Thp  integration,  however,  ir*  niiite  rongti 
--  for  the  most  pan  we  deal  first  witti  cotTiputmg 
and  trion  with  data  anaivf-is.  There  arc  some  ruce 
opporiumticc.  for  taking  advantage  of  the 
conipiernentarv  nature  of  these  twt'  areas  of 
statistics.  Out  many  of  the  topics  are  basic 
eiernents  of  computing  and  so  must  he  taught  first 
on  their  own. 

Clean-,  some  topics  must  he  left  out  of  a  one* 
seniesier  course.  Our  choices  reflect  not  oni\ 
judgments  at)out  the  relative  importartce  of  various 
topics,  tiui  also  the  existence  of  related  courses  m 
the  dPfrartment.  Some  topics  that  aie  sontriimes 
mentioned  as  imfjortant  in  a  course  on  statistical 
computing  fn  hettei  mu-  other  f»a*n  of  our 
curriculum.  For  example,  the  additi(>»*  foiiiine^ 
to  an  exipnsihir  software  package  n  a  ifuur  trial 
cat\  he  covered  'r\  the  statistical  sottv.a**-  ftacvages 
course  This  is  an  uttdergr  aduate  ctnir^r  ii>  (>ur 
curriculum,  ifiougti  several  graduate  sturfems  usuaii’v 
attend  it.  For  alternative  suggestions  see  Bates 
I19S3'  and  Kennedy  (1983). 


3  1  Fundamental  Topics  in  Computer  Science 


3.1  1  Computer  0rgani2ation  and  Hardware 

We  feel  mat  it  is  imfronani  if'  h«iv#  an 
afuneciaiion  for  the  organizational  snuctur*-  of  a 
computer  anrt  stispeci  tnat  this  wih  pecome 
somewhat  mrrre  important  as  vanoie-  >  lnd^  of 
cortcur  r  em  compuiat  ion  tiecom'-  mor^- 

commonplace.  We  itierefore  discus'  inr-  most 
basic  elemenir  of  arctutecture.  d»’sc  r  itnnu  inr 
central  r'rocessmg  unit,  inemory.  and  in{>ut(HitfHit 
(fevice.^.  Ji  can  tre  worthwhile  to  discu'*.  tmsset. 
and  rTucropfttgiamminq.  We  usuaiU  tail-  at»ou'  tfie 
arcttneciure  of  a  particular  machine  if'  some  (feia»'. 
and  ii  makes  sense  to  disctiss  trie  maenme  mat 
students  will  use  most  fieaviu  Cuiientn,  that 
machine  is  tnr  DEC  VAX  11  7S0  m  <»Mf  ne«i 
Iteration  wt'  mav  also  uiscuss  ttie  VAXstation  ii 
Various  kinds  of  parallel  arctutecture*  could  t'e 
mclufled  here  tuit  we  (uefer  to  f'fistfione  it»at 
material  untn  we  actuaiiv  discus'  concurrent 
processing  in  detail. 


3  12  Data  Representation 

A  t  Ml -f  (,.1  lut'  knowiedip  (’t  mieinai  tiata 
f  opf  •■‘.oni  at  H ui  i'-  a  (ir eregu'Mle  !(,>  ur‘f t'’f  s r anduui  f>f 
Sf'verai  orfK>/  tofucs,  sricn  a*  arithrufuic.  eiror 
anaiy^'  rarujom  niimher  g*>nefation.  anrl  riastung. 
Ot>\inn«.i\  ifiis  kriowiedije  is  also  critical  to 

protuan'  det'iKiging.  We  feel  it  is  essential  that 
'■•tudoni'  understand  ttie  r efir esent at  n'ns  in  ■  d  nc't 
oru .  oo  ouMi  machine  t^ut  als('  on  a  vaiieiy  c'f 

oOtfr  machines  We  cover  fixed  point  nurntrers. 
fioatiruj  point  numtiers  (including  trie  iFEE  F7S4 
Fioaiirut  p(nni  Standard),  ctiaracier  data  iBCO, 
EBCDIC.  ASCmi,  and  tirt  strings. 

3  13  Cornputer  Arithmetic 

Bas'C  t(«  understanding  of  numericai  aitai.sis  is 

urufer  staridinq  of  computer  arithmetic  and  roiirufing 
errt"  ^tuder^ts  Should  he  aware  o'  inr*  tiasic 

operari<ui<.  wfnch  are  avaPati/e.  hfuA  inev  af 

[ler  f f'f  rnpfj,  and  the  types  of  errors  tfiat  can  occur, 
suet’  a‘  overflow,  underfkiw.  and  »oiing»fuj  For 


example,  it  is  wen  known  to  the  computing 
communiiv  (though  often  not  to  students'  that 
computation  of  a  sum  of  squares  by  the  so-caiied 
"desk -calculator ’  algor i trim 


i«i  numerically  unstable.  Students  should  leam  of 
the  better  methods,  and  why  ftiey  are  superior. 
(See  Chan,  et  ai..  1983.'  We  also  introduce  inc 
techniques  of  error  analysis,  including  tiackwards 
error  analysis  and  stochastic  error  analysis. 

Students  need  to  understand  that  different 
machines  use  different  data  representations,  and 
different  tectinoiogies  for  rounding.  Tticy  should 
appreciate  the  efferi  of  these  differences  on  the 
accufdcv  of  comfuifer  anttimetic.  They  should  ais(' 
know  of  the  lEEE  standard  for  floating  point 
compuiai lon.s.  and  understand  its  advantages,  and 
thev  should  t>e  cr’gnizant  of  programming  methods 
that  achieve  the  effects  of  extended  precision. 


3.14  Data  Structures 

Students  wno  nave  programmed  in  Fortran  or 
Pascal  will  know  wtiai  an  array  is,  but  typically 
they  have  no  experience,  or  even  awareness  of 
other  data  stiuctures.  the  use  of  pomlcrs.  and 
related  atqonihrTis-  We  introduce  students  to  a 
variety  of  useful  dala  struciures  including  Linear 
lists  and  linked  lists,  arrays,  graphs,  bees,  and 
hash  tables.  At  the  same  time  we  cover  a  varipts 
of  related  algorithms,  such  as:  insertion  and 

deletion  of  data  items  fioni  these  structures, 
tiaiancinq  trees,  garbage  collection,  eic 


3  15  Basic  Algorithms 

in  add'iion  to  dH|c>i  grini'-  rpia'mu  t('  data 
siruciui<*‘  ttwrt*  air  i*asic  aKjonitim'  and 
theoretical  issues  that  students  should  be  aware  of. 
Our  list  inciiujes  itpratiof'  ifnosi  Students  already 
know  this',  rectrrsion  (the  divide  and  Conquer 
Strategy.  FFT,  hnea'iime  medians),  sorting, 
searching,  and  NP'Cc'mpieteness  le.g..  the  Traveling 
Salesman  protuenii 


3  2  Numerical  Techniques 


3  2  1  Linear  Algebra 

It  is  essential  ttiat  Mudenis  understand  how  the 
computatu'os  if)f  leasi  srjuares  linear  regression 
are.  or  *hfMii(t  t'C.  performed  They  need  to 
understarui  trir  computational  details  of  Gaussian 
elimination  arui  ttu'  Ct'oioik  '.  decr'mposit  umi  of 
X'X.  lriPs  rieert  tr  unrlersiand  ttir  Ofltu*f)nnai 
decomiK»sitiori  tcviinuiues  Houspfiolde'  rotations, 
the  QR  ih'c « 'mjx'Sit  iof>  and  ttie  singular  value 
decomprrf.it  lOi,  iri  <'U»  (11. 'pram  thps.e  t«''pi':'-  are 

covered  l-ruf*.  (Jurinu  a  fir  i-vear  cjiaduan’  course 
in  mattif'mat 'C  ai  iTietru><}‘,  fc  statistics,  tuit  p  \< 
wonti  revu'winu  an'l  eiat'oratmg  tieie  SturlrMU^ 
should  ai‘''C  understand  what  »s  gained  and  wtiat 
lost  wtien  iiie  c  onil‘Ul  at  ions  are  performed  on  X'X 

There  IS  a  vanetv  of  other  tojiics  we  cover  in  less 
detail.  including  eigr'nv  ec  i  or -eiqrn  v  alue  nudriods 
such  as  till'  symmetric  QR  mettiod.  cc'ridPK’n 
numtiers  arul  conu»irt  at  innai  accuracy.  ANOVA 
cainilations  lor  orttiogoriai  rfesigns  and  conjugate 
gradient  techruqurs  lor  non-orttiogorial  designs 


3  2  2  Optimization 


3  2.4  Quadrature 


Anioiiu  cornputaiionai  protilerns  of  dfiplied 

statistics,  optimization  is  ut)inuitous.  thr  most 

cofTimon  application  Doing  maximum  likphnoort 

estimation.  An  excellent  lecent  text  on  nonlinear 
optimization  that  we  have  used  is  Dertnis  and 
Schnaliel  (1'^83';  see  Kass  1198^'  foi  an  extended 
review.  Hgn-qualily  Fortran  programs  are  also 

avaiiaPie.  e.g.,  in  MINPACK  and  IMSL. 

We  believe  that  students  need  to  understand  Doth 
the  theoretical  and  computational  issues  in 
optimizanon.  We  consider  Newton  arid  Newlon-like 
methods,  and  the  simplex  method  (Neldei  and 
Mead,  1965'  for  general  minimization  funblems.  and 
the  Gauss-Newton  method  and  derivai»ve-free  least- 
squares  le.q..  Ralston  and  Jennrecfi,  1978*  for 
nonlinear  least -srtuare^ .  Wc  expect  Pie  Mudrni^  to 
learn  trie  Paste  analysis  including  thf>  corwcMgencr- 
and  rates  ot  convergence  arqumentr.  they  sitouid 
also  ufulr^rsiand  the  stopping  rules.  in  aiJdition.  we 
discus'  some  ideas  for  dealing  win  consuamis 
We  ilevote  most  of  our  efforts  ic*  Newton  5. 
methc  d  and  ns  variants. 

In  c)ur  teaching  experience  we  nave  found  it  quite 
worinwhilp  to  qo  over  various  Newton-hke  metnod*:^ 
in  ttir  one-dimensional  case,  and  require  tne 
siudt'rus  to  write  programs  implementing  each  of 
the  techniques  discussed  M  is  imiiortant  for 
sttKlmts  to  understand  the  motivation  for  tfte  use 
(•f  Nr'wion-iike  methods,  as  well  as  to  gam  some 
idea  of  the  c*i)lions  avadahie  and  Ihpu  possitHe 
{iitfaii',  Fuftiiermore.  the  use  of  difference 

qiuUienis  in  place  of  derivatives  opens  tne  door  to 
the  stud',  of  secant  methods,  which  forrn  the  basis 
c»f  most  available  high-qualitv  general-purpose 
code  There  is  clever  and  sometrmes  elegant 
maihr'inaiics  uiv('ived,  and  this  ciass  of  methods 
has  r(«crived  substantial  attention  in  recent  years. 
An  aspect  of  secant  methods  that  is  extiemeiN 
unporiani  for  siattstics  is  tnat  the  aftfiroxunanons 
10  ihf  Hessian  deiefiorate.  We  mai- e  sure  the 
Mud»-nts  appreciate  that  the  final  Hessian  m  ir»r 
C'litiiut  must  ijuas'-Newton  prouiams  should  n(>t 
l-e  ifusied 

W»  siicnct  some  time  ori  imea'  proqiammuig 

and  constrained  optimization  proliiems  Wn  locus 
on  Lac^ranqe  multipliers  arm  \u*  kuhn-Tucke« 
cofuM  ion'-,  in  additicm,  we  discuss  some  o<  iti' 
tjiscreie  opt  uTuzai  lon  protriemf-.  a  io(*ic  iha’  wr 
lipiu-vr'  Will  become  mcreasunqiv  unporiani  for 
S(ati^.i-C‘.  . 


3  2  3  Approximation  of  Functions 

Appr o'lmation  of  functions  is  already  farTuliar  to 
siat'Stics  students  m  die  form  of  L  aptirox»mat»on. 
The  choices  p=  l  and  2  are  most  f^amuiai.  tnit  tor 
rnari\  c ompul at lonai  ptirposes.  «  is  more 

a()p' c'pi  lat f  Orthogonal  pnivnomiais  are  also 

familiar,  iiui  ifiscussioii  trieu  use  u*  comiiuiing 
Will  present  them  in  a  manner  that  manv  stiirJenls 
will  not  nave  seen.  Tnis  is  a  crucial  t'ari  of  irieu 
mat tif’mat ic ai  training,  as  well,  so  u  siK'uitt  noi  ti*- 
si- ipped.  Ai‘;o  fundamental  is  an  introrfuciK'n  to  the 
tneorv  of  rational  function  ap(>r  ox  loiai  lorv  m 
ad'lition.  v^e  include  discussi<>n  of  imeriKualuui. 
Althouqn  i rujonoiTietr ic  appr o* imat ron  and  the  Fast 
Fouiipr  Transforni  mas  t^r*  covf’rf'd  m  other 
coursr*^.  Us  *m|HUtanC('  triages  (nclir''ron  r>f  ii  iipre 
nuitiK  dr’sirahlr.  Some  (ti'-iussion  of 

approximation  hv  splines  is  also  useful 


Siudenis  sfiould  b(  aware  of  trie  pc'ituiar  quadrature 
methods  for  evaluating  definite  integrals  It  is 
important  to  discuss  tiotti  one  stpfi  use  of 
quadrature  formulas  m.g.,  for  Gaussian  ot  Nc-wton- 
Cotes  quadrature',  riavmq  well-known  error  funmds, 
and  also  adaptive  ejuadrature'  methods-,  sucfi  as 
the  (MSL  routine  DCADRE.  The  stofipmq  rule  for 
adaptive  quadrature  is  crucial,  since  one  can 
introduce  errors  into  the  solution  bv  using  slopping 
rules  based  only  on  the  change  m  approximationf 
achieved  at  successive  iterations.  (See  Boiver  and 
Schervisfi.  1981,  and  Schervish.  1984.  for  examples.* 
The  errors  become  particularly  troublesome  m 
multivariate  integrals.  Students  should  also  be 
exposed  to  the  Monte  Carlo  methods  of 
integration,  including  importance  sampling. 


3  3.  Computer  Intensive  Methods 


3.3.1  Graphics 

Much  intefesimq  recent  research  m  statistical 
computing  involves  graphics.  There  is  certamK 
room  in  a  course  such  as  this  for  extensivf 
discussion  of  statislicai  graphics.  including 
methods  such  as  protection  puisiut,  but  we  have 
not  vet  empf^ssizetf  ihis  area  wiihm  our  version  of 
ttie  course.  At  the  least,  students  need  sonu 
awareness  of  ttK*  (mgmng  efforts  and  tlie  existing 
methods  for  displaying  muiiivaiiate  data  as 
described,  for  examfue,  in  Gnanadesikan  {1977;  and 
Ciiambers  ei  ai.  fl983>. 


3  3,2  Random  Numbers  and  Simulation 

Simulation  method*-  are  often  used  to  evaluate  the 
accuracv  of  asvrnptotic  aproximanons;  m  some 
cases  where  analytical  results  are  not  available  a 
Simulation  IS  the  oniv  avaddlJie  technique.  Since 
random  numt>cfs  from  a  given  distribution  mav  ttr 
generated  finm  a  sequence  of  unifoimlv  distributed 
random  numtmrs.  th(  basic  protiicm  is  the 
generation  *'‘1  uniformiv  distributed  random 
numbers.  Standaref  metriods  include  the  linear 
c<in9tuehtiaf  mettiod.  the  feertbaci'  sluft  reqistei 
method,  and  comtenalion  methods  such  as  the  idea 
of  MacLaren  and  Marsaqlia  (19651  use*  one  sequence 
of  random  numberf.  to  shuffle  another  Once  a 
sequence  of  uniformly  distributed  random  numbers 
has  been  generated,  otiservaiions  from  arlutrars 
distributions  mav  he  derived  tn  ^aru'us  ipcruiiquer-, 
including  use  c>f  the  inverse  distr  ituit  lon  function 
and  accepionce-rejection  mettiods. 


The  usual  goal  of  a  Monte  Carlo  experiment  is  to 
estimate  the  mean  or  some  other  functional  of  the 
sampling  distribution  of  a  statistic.  Various 
techniques  for  variance  reduction  are  used, 
including  mneased  sample  size,  use  of  antithetic 
vanatiles.  and  stratification, 

Morite  Cari('  St  >  u  techniquef  ha\'f'  othpi 
applicatinii'  t('  stat'si'cni  fuactice.  mcludinq  the 
evaluation  c'l  ruqti  dunensKUial  integral!-,  evaluation 
of  posterior  distrit'iiition  .  and  t  u  >ot  st  r  appmg 
Studf'nts  must  <tain  a  solid  under  st andinn  of  trie 
basic  elements  of  thi>,.  central  topic  m  statistical 
cornpui  mg. 


*«*.***-**»-*W<At_Vt  *•§_*«<  *‘>*,*>4  HI, 
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3.4.  Concurrent  Processing 

We  believe  that  the  most  dramatic  change  in 
computing  in  the  next  decade  is  going  to  he  the 
evolution  of  the  various  very  high-speed 
computers.  Oui  students  need  some  appreciation  of 
this,  and  we  discuss  concurrent  computation  in 
several  parts  of  the  course.  Our  detailed 
introduction  includes  description  of  various 
archiiectuies  (see.  e.g.,  Schwartz.  1983). 
mierprocessor  communication  networks,  and  a  little 
material  on  numerical  analysis  (see.  e.g.,  Schendel. 
1984).  We  expect  that  the  next  iteration  of  our 
course  will  include  some  actual  hands-on  work  with 
our  array  processor. 


3.5.  Writing  Software 

Bates  (19831  reports  that  completion  of  a  term 
project  of  writing,  testing,  and  documenting  a 
piece  of  statistical  software  gives  students  a 
valuable  sense  of  the  requirements  of  producing 

good  software.  We  prefer  to  have  suidents  devote 
their  time  to  learning  the  large  amount  of  material 
we  cover,  but  we  share  with  Bates  the  desire  to 
impart  an  appreciation  of  some  of  the  concepts  of 
software  engineering.  such  as  top-down  and 
modular  design  and  structured  programming 
languages,  and  the  varietv  of  useful  tools  for 
software  writing,  including  the  subroutine  packages 
such  as  iMSL  and  LINPACK.  interactive  languages 

such  as  APL.  and  matrix  manipulation  languages 

such  as  those  in  SAS  or  S.  Thus,  we  integrate 

these  topics  into  the  course  where  we  cart,  but  do 
not  devote  much  time  to  software  writing  per  se. 
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The  growing  field  of  statistical  coaputing  has  created  the  need  for  students  to 
obtain  a  more  foraal  education  in  the  subject.  This  gives  rise  to  the  following 
questions.  VThere  does  statistical  coaputing  fit  into  the  education  of  statistics 
majors?  Is  there  some  cosmon  statistical  computing  body  of  knowledge  these  students 
should  receive?  How  machine  oriented  should  this  training  be?  These  topics  are 
addressed  from  the  perspectives  of  both  undergraduate  and  graduate  study  in 
statistics.  Is  it  our  goal  to  teach  students  studying  statistical  coaputing  a  skill 
or  the  theory  behind  that  skill?  The  answer  to  this  question  may  be  based  on  the 
level  of  education  and  the  background  required  of  the  student  before  entering  a 
statistical  coaputing  course. 


1.  INTRODUCTION 

This  section  of  the  conference  is  about 
the  teaching  of  statistical  computing.  Is 
statistical  computing  sufficiently  important  to 
be  included  in  a  statistics  program?  Rather 
than  give  ay  own,  perhaps  biased,  opinion  of 
the  importance  and  nontrivial  nature  of 
statistical  computing,  I  quote  M.G.  Kendall 
(1972). 

bright  ideas  do  not  fructify  unless  we 
can  bring  them  to  bear  on  numerical  material, 
and  for  many  of  our  outstanding  problems,  as 
we  shall  see,  the  computer  is  necessary.** 

"...  the  statistician  requires  a  full 
mathematical  armory  to  bring  his  solving 
process  to  the  point  where  the  machine 
can  take  over  if  required." 

Statistical  computing,  unlike  other  areas 
of  specialization  within  the  discipline  of 
statistics,  has  an  ambiguous  connotation.  A 
popular  notion  about  someone  trained  in 
statistical  computing  is  that  they  are  simply 
very  clever  in  manipulating  statistical 
software  packages.  This  is  neither  the  goal 
nor  the  outcome  of  a  statistical  computing 
education. 

One  way  to  remove  this  ambiguous 
connotation  is  for  those  of  us  in  the  field  to 
establish  what  major  topics  should  be  included 
in  statistical  computing  courses.  It  is  clear 
in  what  course  a  student  will  learn  about 
stratified  random  sampling  and  ratio 
estimators.  It  is  not  evident  in  what  course, 
if  any,  a  student  will  learn  about  random 
number  generation,  sweep  operators,  and 
numerical  stability. 

This  paper  outlines  topics  that  ought  to 
be  included  in  statistical  computing  courses. 
Statistical  computing  training  for  both 


graduate  and  undergraduate  students  is 
discussed.  Suggestions  are  made  regarding 
where  these  courses  fit  into  the  statistics 
curriculum  and  how  machine  oriented  they  should 
be.  It  is  hoped  that  a  result  of  the  papers 
presented  in  this  section  of  the  conference 
will  be  to  stimulate  discussion  among  those  of 
us  involved  in  statistical  computing  about  the 
issues  mentioned  above. 


2.  STATISTICAL  CONFUTING  TOPICS 

Two  interesting  committee  reports  about 
the  training  of  statisticians  have  appeared 
recently  in  The  American  Statistician  .  The 
first  appeared  in  Hay  1980  and  was  directed  at 
the  training  of  statisticians  for  employment  in 
industry.  The  second  appeared  in  Hay  1982  and 

dealt  with  the  training  of  the  statistician  for 
the  federal  government.  As  might  be  expected, 
there  is  considerable  overlap  in  the 
recommendations  given  in  these  reports. 
Computing  skills  and  a  knowledge  of  statistical 
computing  was  indicated  to  be  important  by  both 
reports.  The  specific  recommendations  in  these 
areas  fell  into  four  categories. 

1.  Knowledge  of  a  scientific  programming 
language. 

2.  Experience  with  several  of  the  most 
popular  statistical  software  packages. 

3.  Experience  with  the  construction  and 
maintenance  of  large  data  base  files. 

4.  Instruction  in  proper  numerical  analysis 
techniques  for  statistical  computations. 

Most  statisticians  would  concur  with 
Kennedy  (1982)  that  Items  1  and  2  should  be  a 
required  part  of  every  statistics  student's 
education.  Kennedy  also  points  out  that  the 
experience  of  Item  3  is  frequently  attained 


3.  UNDERGRADUATE  PROGRAM 


through  involvement  in  consulting,  for 
students  specislizing  in  ststisticsl  computing, 
8  special  effort  should  be  made  to  acquire  this 
experience  in  data  base  management.  To  fulfill 
Item  4  the  student  would  need  to  complete  one 
or  two  statistical  computing  courses. 

There  appears  to  be  mixed  feelings  within 
the  statistical  computing  community  as  to 
whether  a  statistical  computing  course  should 
be  a  requirement  or  an  elective  for  the 
statistics  major.  In  any  case,  statistics 
majors  should  gain  an  awareness  of  what  general 
topics  are  considered  to  be  in  the  field  of 
statistical  computing  from  their  overall 
statistics  education.  One  purpose  of  the  text 
Statistical  Computing  by  Kennedy  and  Gentle 
(1980)  was  to  present,  in  one  place,  material 
that  is  central  to  the  area  of  statistical 
computing.  A  brief  outline  of  the  topics 
included  in  their  book  is  as  follows. 

1.  Introduction  to  the  history  and  literature 
of  statistical  computing. 

2.  Computer  hardware  operating 
characteristics. 

3.  Computer  software  and  programming 
considerations  for  package  design. 

4.  Floating-point  arithmetic  and  an 
introduction  to  error  analysis. 

5.  Random  number  generation,  testing,  and 
an  introduction  to  general  simulation 
methodology. 

6.  Approximating  probabilities,  percentiles 
and  other  special  functions. 

7.  Numerical  methods  in  linear  algebra  with 
emphasis  on  methods  moat  useful  in 
statistics. 

8  •  Linear  least  squares^  computations 

including  model  building  and  solutions 
under  constraints. 

9.  Nonlinear  least  squares  computations  for 
unconstrained  and  constrained  problems. 

10,  Computational  methods  for  alternatives  to 
least  squares  -  robust  methods. 

A  partial  overlap  with  the  material  listed 
here  can  be  found  in  Computational  Methods  for 
Data  Analysis  by  Chambers  (1977).  An 
additional  topic  included  in  Chambers's  text  is 
graphical  procedures.  Another  interesting  book 
on  the  subject  of  statistical  computing  is 
Statistical  Computation  by  Haindonald  (1984) 
which  deals  extensively  with  Topics  7  to  10  in 
the  outline.  It  is  appropriate,  in  this 
author's  opinion,  to  include  all  of  the  topics 
listed  above  as  well  as  some  graphical 
procedures  in  the  battery  of  statistical 
computing  courses  which  is  offered. 


A  distinction  has  not  yet  been  aiade  in 
this  paper  between  undergraduate  and  graduate 
education  in  statistical  computing.  In 
general,  the  difference  between  undergraduate 
and  graduate  study  in  any  area  of 
specialization  is  usually  the  amount  and  depth 
of  the  material  covered.  The  basic  content  of 
the  material  remains  largely  the  same.  There 
is  no  reason  for  statistical  computing  to  be 
handled  differently. 

At  present,  there  are  several  recurring 
themes  in  undergraduate  statistical  computing 
courses.  These  are  data  structures,  data  base 
management,  and  the  use  of  statistical 
packages.  This  may  be  due  to  the  lack  of 
appropriate  prerequisites  for  a  statistical 
computing  course  such  as  calculus  and 
undergraduate  mathematical  statistics,  thus 
Slaking  it  difficult  to  consider  many  of  the 
topics  listed  in  Section  2.  Data  structures 
and  data  base  sianagement  are  some  of  the  ACM 
(Association  for  Computing  Machinery) 
curriculum  recomendations  for  computer  science. 
Thus,  students  could  probably  acquire  expertise 
in  these  areas  by  taking  a  course(s)  to  be 
found  among  the  university's  computer  science 
offerings.  If  statistical  package  experience 
other  than  what  is  obtained  in  the  required 
statistics  courses  is  needed,  then  perhaps  a 
specific  statistics  package  course  should  be 
offered.  To  avoid  unnecessary  confusion  with 
respect  to  the  field  of  statistical  computing, 
it  is  suggested,  by  this  author,  that  courses 
of  the  nature  just  discussed  be  titled 
something  other  than  statistical  computing. 

With  the  prerequisites  of  calculus, 
probability  theory,  and  some  computer 
programming,  a  first  course  in  statistical 
computing  for  the  undergraduate  statistics 
major  could  include  Topics  1  to  6  from  Section 
2  and  some  graphical  procedures.  This  set  of 
material  does  not  require  a  sophisticated 
background  in  either  linear  algebra  or  linear 
models.  It  would  be  very  easy  for  such  a 
course  to  turn  into  a  general  numerical 
analysis  class.  When  teaching  statistical 
computing,  care  must  be  taken  to  emphasize 
which  numerical  methods  are  important  to  the 
statistician  and  why  they  are  important.  A 
second  course  in  statistical  computing  for  the 
undergraduate  student  is  probably  not 
necessary.  The  student  may  benefit  more  from 
an  additional  mathematics  class  or  exposure  to 
another  area  of  specialization  within 
statistics. 


4.  GRADUATE  PROGRAM 

To  study  statistical  computing  at  the 
graduate  level,  prerequisite  knowledge  of  a 
scientific  programming  language,  statistical 
theory,  and  statistical  methods  is  needed.  For 
a  first  semester  graduate  course  in  statistical 


computing,  Kennedy  (1982)  recommends  covering 
ell  the  meterial  listed  in  Section  2.  This 
results  in  breadth  but  not  depth  of  coverage. 
Kennedy  suggests  an  advanced  selected  topics 
course  be  offered  for  those  students  interested 
in  specialising  in  statistical  computing.  The 
success  of  the  statistical  computing  program  at 
leva  State  University  indicates  that  Kennedy's 
plan  works  well  in  both  exposing  the  statistics 
graduate  student  to  statistical  computing  and 
in  preparing  students  to  carry  out  statistical 
computing  research. 

An  alternative  to  the  program  described  by 
Kennedy  (1982)  might  be  to  offer  two 
nonsequential  graduate  statistical  computing 
courses.  One  would  cover  Topics  1  to  6  and  a 
second  would  cover  Topics  7  to  10  from  Section 

2.  This  would  provide  a  more  in  depth  coverage 
of  the  subjects.  Courses  of  this  nature  could 
also  allow  time  for  inclusion  of  extra 
statistical  computing  sMterial  of  special 
interest  to  the  professor.  Although  greater 
depth  of  coverage  and  a  choice  of  topics  siay 
better  suit  the  needs  of  the  student  who 
completes  only  one  course  in  statistical 
computing,  this  program  may  fall  short  of 
meeting  all  the  needs  of  students  wishing  to 
specialise  in  statistical  computing. 


9.  HACHIHE  ORIENTATION 

As  is  indicated  by  the  title  of  this 
paper,  this  author  believes  the  numerical 
analysis  aspect  of  statistical  computing  should 
be  emphasised  more  than  the  computer 
programming  aspect.  It  is,  however,  important 
for  students  to  understand  the  constraints 
imposed  on  the  numerical  isethods  by  the 
computer.  This  knowledge  is  the  key  to 
improving  their  computer  programming  skills. 

The  computer  can  be  effectively  used  to 
reinforce  the  algorithm  construction  and  the 
numerical  analysis  necessary  to  bring  a 
statistical  concept  from  its  theoretical  form 
to  its  computer  approximation.  Too  much 
computing  tends  to  pull  the  emphasis  of  the 
statistical  computing  course  away  from  the 
statistical  and  numerical  analysis  issues  and 
towards  computer  science  and  computer 
programming  problems. 

To  some  degree,  the  amount  of  computing 
will  depend  on  the  background  of  the  students 
in  the  class.  Also,  more  computing  would  tend 
to  be  included  in  an  undergraduate  statistical 
computing  course  than  in  a  graduate  course. 
Direct  exposure  to  some  of  the  statistical 
libraries  would  be  beneficial  to  the 
undergraduate  student.  The  graduate  student 
will  usually  become  familiar  with  these 
libraries  through  other  statistics  classes  or 
in  their  consulting  experience. 

It  is  more  important  that  a  student  who 
has  studied  statistical  computing  be  able  to 


determine  if  the  statistical  needs  are  being 
statisfied  by  a  given  algorithm  than  to  be  a 
top  notch  scientific  prograsmer.  The 
statistical  computing,  student  should,  however, 
gain  enough  background  in  a  statistical 
computing  course  to  communicate  effectively 
with  the  computer  scientist  about  stabilty  of 
algorithms  and  programming  considerations  which 
optimise  computer  resources. 

Perhaps  due  to  the  increased  interest  in 
general  computational  methods,  numerical  linear 
algebra  coursea  and  database  management  courses 
are  more  readily  available  in  mathematics 
and/or  computer  science  departments.  Also, 
students  are  exposed  sooner  and  in  more  depth 
to  the  statistical  software  packages  in  the 
standard  statistics  core  classes.  Students 
completing  the  courses  mentioned  above  learn  to 
handle  canned  routines  and  learn  to  do  some 
scientific  programming.  Using  these  courses  as 
prerequistes  to  the  statistical  computing 
course(s)  would  make  it  possible  to  concentrate 
on  numerical  methods  in  statistics  with  less 
emphasis  on  both  pure  numerical  analysis  and 
computer  programming.  The  software  packages 
cannot  possible  keep  up  with  all  the  new 
statistical  methodology  or  incorporate  every 
possible  twist  in  the  more  common  methods. 
Consequently,  it  is  desirable  to  educate 
students  in  statistical  computing  in  such  a  way 
that  when  confronted  with  a  statistical 
analysis  problem  the  will  not  be  constrained  to 
those  methods  that  are  available  in  existing 
computer  software. 


6.  CONCLUDING  RBIARKS 

Minton  (1983)  discusses  the  establishment 
of  statistics  as  a  discipline.  The  criterion 
given  in  Hinton's  paper  for  the  visibilty  and 
recognition  of  a  discipline  can  also  be  applied 
to  visibilty  and  recognition  of  an  area  of 
specialisation  within  a  discipline.  The 
criteria  are 

1.  A  theory  and  body  of  literature; 

2.  A  significant  number  of  professionals 
working  in  the  field; 

3.  More  than  a  few  professional  journals 
regularly  publishing  new  advances  in 
the  subject; 

4.  A  significant  market  demand  for  ita 
services. 

The  field  of  statistical  computing  is 
moving  towards  fulfilling  all  of  these 
criteria.  To  expedite  this  effort  it  would  be 
helpful  to  define  more  clearly  the  statistical 
computing  body  of  knowledge.  This  can  be 
accomplished  through  our  statistical  computing 
course  offerings  and  through  the  exposure  we 
give  students  to  statistical  computing  in  the 
other  statistics  classes. 
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ANIMATING  STATISTICAL  ALGORITmS 


Marc  H.  Brown 

Department  of  Computer  Science 
Brown  University 
Providence,  R1  02912 

High-performance  graphics-based  workstations  have  made  possible  a  quantum 
leap  forward  in  the  quality  of  tools  available  for  teaching  and  studying  statistical 
algorithms.  For  example,  the  Department  of  Computer  Science  at  Brown  University 
has  a  specially  designed  auditorium/lecture-hall  containing  60  such  workstations, 
interconnected  by  a  high-bandwidth  resource-sharing  local  area  network  (LAN), 

Rather  than  using  a  chalkboard  or  viewgraphs,  instructors  are  able  to  use  dynamic 
simulations  of  algorithms  being  taught.  Students  are  able  to  interact  with  these 
real-time  animations  in  order  to  gain  better  insight  into  their  operational 
characteristics.  Students  are  transformed  from  passive  listeners  to  active 
participants  in  the  learning  process. 

In  this  talk,  we  will  describe  the  software  environment  we  have  developed  for 
animating  algorithms.  Typically,  animations  contain  multiple  vieue  of  the  data. 

As  the  algorithm  progresses,  all  of  the  views  are  updated  simultaneously.  Users 
are  able  to  stop  the  animation  at  any  time,  control  the  speed  of  the  animation  and 
even  whether  it  should  run  in  reverse,  single-step  and  set  breakpoints  using  entities 
meaningful  to  the  algorithm  being  animated.  In  addition,  multiple  algorithms  may 
be  run  in  parallel  in  order  to  better  compare  and  contrast  them.  We  will  also  give 
examples  from  the  host  of  computer  science  and  statistics  algorithms  that  we  have 
nnimaged,  and  show  a  videotape  of  some  animations. 


Discussion  on  Teaching  of  Statistical  Computing 


Richard  H.  Helberger 


Temple  University 
Department  of  Statistics 
Philadelphia.  PA  d9t122 

This  discussion  comments  primarily  on  software  design  topics  other  than  the 
numerical  analysis  Issues  covered  by  the  other  speakers.  It  Includes  a  short 
discussion  of  my  attempt  to  Illustrate  by  counterexample  the  dictum  that  Householder 
reflection  calculations  should  be  based  on  the  numerically  optimum  reflection  angle. 


The  speakers  were  consistent  In  their  emphasis 
on  the  fundamental  area  of  numerical  analysis. 
The  major  addition  I  have  to  the  numerical 
analysis  theme  la  a  recommendation  for  the  new 
bock  by  John  R.  Rice  [d]  as  a  major  reference 
for  everyone's  library  and  as  an  excellent  text 
for  a  numerical  analysis  course.  Rice  discusses 
the  derivation  of  algorithms,  programming  of 
algorithms,  and  use  of  published  software. 
Graphs,  examples,  subroutines,  and  problem  seta 
are  In  abundance.  Pathological  cases  are 
carefully  treated.  There  are  several  chapters 
on  design  and  use  of  program  libraries.  The 
book  Includes  the  ACM  Index  of  all  algorithms 
published  from  i1960^1980  In  d?  major  Journals 
and  a  detailed  Index  to  the  IMSL  Library 
Subroutines. 

In  my  course  I  also  place  a  strong  emphasis  on 
Issues  of  design  of  programming  systems  and 
packages.  Not  only  do  I  discuss  Individual 
algorithms,  I  also  place  them  In  the  context  of 
a  package.  Therefore  I  discuss  communication 
among  subroutines,  design  of  overlay  structures, 
and  design  of  user-friendly  user  Interfaces.  I 
stress  the  Importance  of  adhering  strictly  to 
professional  programming  standards  to  make 
long-term  maintenance  of  a  system  possible.  One 
of  ray  class  projects  Is  an  assignment  to  write  a 
simple  subroutine  and  attach  It  to  an  existing 
package  to  take  advantage  of  the  user  control 
language  and  data  management  facilities 
developed  for  the  package.  I  have  used  MINITAB 
[2]  and  SAS  [33  for  this  purpose. 

I  also  rind  It  helpful  to  explore  the  boundaries 
of  a  problem.  For  example,  while  discussing  the 
Householder  reflection,  I  decided  to  verify  that 
the  sign  of  the  reflection  really  made  the 


Important  difference  to  numerical  stability  that 
Is  claimed  for  It.  It  does,  of  course,  but  It 
was  Initially  difficult  to  construct  a  case 
where  choosing  the  wrong  sign  caused 
cancellation  of  significant  digits. 

I  found  two  conditions  were  necessary  for  an 
example  to  display  numerical  difficulties.  The 
two  defining  vectors,  the  ones  that  are  to  be 
reflected  onto  a  constant  times  the  direction  of 
the  other,  must  be  nearly  linear  dependent  and 
the  computations  must  use  single  precision 
accumulation  of  Inner  products.  Only  with  that 
combination  of  conditions  was  I  able  to  use  the 
non-optlmal  sign  to  create  a  "reflection"  matrix 
that  did  not  reflect  the  two  vectors  onto  each 
other.  Neither  near  dependence  nor  single 
precision  accumulation  by  Itself  was  enough  to 
sake  the  wrong  sign  give  trouble.  Only  when 
Instability  was  present  In  both  the  data  and  in 
the  computational  process  was  there  an  Incorrect 
calculation.  This  example  reinforces  two 
conclusions.  One,  proper  computational 
paranoia,  such  as  always  automatically  using 
double  precision  accumulation,  can  protect  you 
from  some  potential  problems.  Two,  well-posed 
problems  with  stable  data  can  lead  to  correct 
computations  even  If  there  are  Instabilities  In 
the  algorithm. 


Id]  Rice,  John  R.,  Numerical  Methods,  Software, 
and  Analysis:  IMSL(r)  Reference  Edition 
(McCraw  Hill,  New  York,  .1983) 

[2]  Ryan,  Thomas  A.,  Jr.,  Brian  L  Joiner,  and 
Barbara  F  Ryan,  HINITAH  Reference  Manual 
(MINITAB  Project,  University  Park,  PA,  .1982) 

[3]  SAS  Programmers  Guide  (SAS  Institute,  Inc., 
Cary,  NC,  .1982) 
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A  MONIE  CARLO  STIH)Y  OF  PARALLELISM  TESTS  FOR  COMPLETE  AND  INCOMPLETE  GROWTH  CURV/E  DATA 

Neil  C.  Schwertman,  Sallysue  Stein,  William  Flynn,  Kathryn  L.  Schefik 
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Monte  Carlo  simulations  using  a  broad  spectrum  of  dispersion  structures  are  used  to 
compare  for  significar^ce  level  and  power  tests  for  the  parallelism  of  tf>e  response 
curves  for  both  complete  and  incomplete  data. 

The  methods  used  are  the  split-plot.  Hotelling’s  T-square,  analysis  of  tfie  estimated 
regression  coefficients  for  eacti  subject*  successive  differerjces,  and  estimation  of 
missing  data.  For  complete  data  the  split-plot  analysis  using  the  Geisser-Greenhouse 
correction  and  Hotellitig’s  T-square  on  the  estimated  regression  coefficients  for  each 
subject  were  best.  For  incomplete  data  the  split-plot  analysis  using  the  Geisser- 
Greenhouse  correction  from  the  smoothed  dispersion  matrix  was  most  satisfactory. 


I.  Introduction 

Fre(|uently  iri  biological,  medical,  agricultural 
and  clinical  studies  measurements  are  taken  on 
the  same  experimental  unit  over  time.  Data 
from  such  studies,  called  growth  curve,  repeat¬ 
ed  measure  or  longitudinal  data,  is  rfiaracter- 
i2ed  by  large  correlations  betwe('n  the  observa¬ 
tions  on  the  some  experimental  unit.  Such  data 
is  properly  arialyzed  usir»g  multivariate  analy¬ 
sis  procedures.  However  whett  the  data  has  miss¬ 
ing  observations  the  usual  multivariate  method- 
olo(jy  doof}  not  adapt  readily.  Kleitibaum  (1975) 
proposed  a  multivariate  procedure  that  accommo¬ 
dates  incoiiH^lete  data  by  generalizing  the  Pott- 
hoff  atid  Roy  (196A)  growth  curve  model, 
luhwrrtman  (197A)  in  a  very  small  simulation 
sHiidy  and  Leeper  and  Woolson  (19H2)  in  a  much 
more  extensive  study  show  tlial  while  Kleinbaum’s 
generalized  growth  curve  model  has  excellent 
lar(jr  sample  properties,  tt>e  simulated  signifi- 
cafice  levels  are  much  too  large  for  small  data 
sets. 

fjchwertman,  Fridslial  and  Hnyrey  (1981)  sug- 
(jcjjtcd  a  nonparnmetric  multivariate  approacli 
to  the  analysis  of  botfi  complete  and  iftcixnplete 
cjiowlt>  curve  data  wtiich  was  quite  sat  it'.fartory 
witti  regard  to  significance  level  hut  did  not 
seem  to  have  mucf)  power. 

Sitire  the  multivarinte  approaches  to  lf»e  analy- 
si«;  of  ificom()lete  (|rowth  cui  ve  ckda  f>ave  some 
(h  f  fu'u  1 1  les  var  ious  univariate  approacljes  for 
the  atialysis  of  nij(’f»  data  have  heen  siif|*jested. 

A  (  (niHi*of»  univariate  analysis  of  growtf*  curve 
(lata  IS  tlie  split-plot  desitjn  with  tune  periods 
as  tlie  sul)plf)t  treatment.  Tins  analysi*;  readily 
adapt '•>  to  mie.sifui  data  l>ut  tite  analysis  <tependr. 

(jt I  tlie  (1 1  e.perji i r»n  stru(lijre  of  the  data  vectors 
on  ea»  ti  soh  jeet  .  Box  19'i^|j  invest  iijat  ed  tfie 
effeits  di'.ppr*;ion  st  ni.  1  ores  f»ave  on  tfie  f 
statistii  ;iMd  (.eissei  and  (ireetihoiea*  (1998) 
prof.o'jed  an  adjustment  to  the  defjreerj  of  freedom 
of  the  I  f.tatistie  to  account  f(u  ftii'  di?:persion 


structure.  Huynh  and  Feldt  (197U)  established 
tfiat  the  necessary  and  sufficient  condition  for 
no  adjustment  to  tlie  degrees  of  freedom  is  that 
the  data  vector  for  each  experimental  unit  have 
tfie  diJipersion  jitructure 

£  =  oSl  +a  J  +  a  J'  *  J  a'] 
pxp  pxp  pxl  Ixp  pxl  Jxp 

where  ,  and  a  are  scalars,  p  is  the  num¬ 
ber  of  lime  periods  at  which  observations  are 
taken,  J  is  a  pxp  matrix  of  ones,  J  is  a  p 
vector  of  ones  and  a  is  a  vector  of  constants 
such  that  a'  J  =  0.  Schwertman  (197B)  extended 
the  Huynh  and  feldt  result  by  showing  tfiat  that 
dispersion  is  sufficient  for  incomplete  data 
sets  as  well  as  c'omplete  and  no  adjustment  to 
degrees  of  freedom  is  reejuired  in  eitfier  case. 
Sctiwertman,  fridslial  and  Mngrey  (1981)  use  a 
small  simulation  study  to  show  tliat  the  split- 
plot  method  is  not  satisfactory  for  the  analysis 
of  growth  curve  data,  particularly  incomplete 
data  ;>els,  wlii('h  does  not  have  tfiat  particular 
dispersion  structure.  Collier  et.  al.  (1967) 
did  an  extensive  Monte  Carlo  study  of  the  use  of 
tfie  fieiJ5srr-Grer*nhou!ie  correction.  Iheir  study 
ar.T'.umed  the  dispersion  structure  is  known  and 
uses  Hie  stria  Lure  to  calculate  the  correction 
factor.  In  this  paper,  methods  of  testing  whicfi 
use  the  (ieisser-GreenlHiuse  correction,  assume 
tfiat  the  dispersion  matrix  from  the  data  is  used 
to  calculate  the  I'orreclion  factor. 

file  purpose  of  tliis  paper  is  to  compare  various 
alternatives  fur  the  analysis  of  complete  and 
incomplete  growth  curve  data  with  regard  to  sig- 
nififanie  level  and  power.  In  this  paper  tfie 
split  plot  analyjii!!  for  both  complete  and  incom¬ 
plete  data  is  simulated  and  compared  to  a  suc¬ 
cessive  differeiK’e  protedure  suggested  by  C.R. 
Ran  fl9'j8)  and  Hill  (1968).  Besides  tfie  split- 
plot  and  r.urrcssive  difference  procedures  the 
use  of  estimated  data  for  the  missing  data  and 
sunmiarizing  data  in  the  regression  coefficients 


for  each  experimental  unit  are  nlao  compared. 

The  various  alternative  testing  procedures 
are  described  in  detail  in  section  II  and  illus¬ 
trated  with  a  sample  data  set  in  section  III. 

Tne  Monte  Carlo  simulation  study  is  described 
in  section  IV  and  the  conclusion  are  contained 
in  section  U. 

II.  Description  of  the  Testing  Procedures 
Simulated 

The  following  methods  of  analysis  are  compared 
for  both  significance  level  and  power  in  the 
Monte  Carlo  study  of  complete  growth  curve  data 
(c-1)  Split-Plot  Analysis  with  time  periods  as 
the  subplot  treatment.  The  Model  used  is 

y...  =  u  +  T.  +  S....  +  P,  +  TP..  +  e... 

’^ijk  ^  1  j(i)  k  ij  ijk 

where  i  =  l,...t,  j=l _ n,  k=l...p,  T.  is 

the  treatment  effect.  S.,..  is  the  subject 

effect  (S.,.,  "v  N(0,  o^),  is  the  period 
Jill  s  k 

effect,  TP.  .  is  the  interaction  between  the 
iJ 

Treatment  and  Period  and  e.  ..  "v  N(Q,a^). 

1  Jk 

Usually  the  primary  interest  is  in  ttie  paralle¬ 
lism  of  the  growth  curve,  that  is  the  TP^j^ 
effect.  The  split-plot  analysis  is: 


^‘^ijk  *  ''ij(k-l)^ 


where  A 

k-1  k 
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Plot 
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t-1 

Error(a) 

Subject 

t(n-l) 

Sub 

Period 

(p-1) 

Plot 

tmt  X  period 

(t-l)(p-l) 

Error(b) 

error 

t{ti-l)(p-l) 

for  the  Monte 

Carlo  study  i: 

:l,...t=2,  j=l,. 

or  lOlii),  k=l, 

, . .  .p=4  or  8.  ( 

[When  P:8  ir>  the 

Monte  Carlo  study,  a  fifth  degree  polyr»omiai  was 
assumed  adequate  to  describe  the  growtti  curve 
and  hence  the  degrees  of  freedom  for  interaction 
was  5  instead  of  (t-l)(p-l)  =  (1)(7).  The  sum 
of  squares  for  the  higher  order  terms  was 
pooled  with  the  error  sum  oJ  squares.) 

(c-2)  The  split-plot  analysis  in  (c-1)  witT»  the 
degrees  of  freedom  adjusted  with  the  Geisser- 
Greenhouse  correction  computed  on  the  estimated 
dispersion  matrix  calculated  from  the  data. 

(c-3)  Successive  Difference  Analysis.  The 
chafige  in  tne  response  is  measured  by  subtracted 
each  observation  from  tfie  subsequent  observation. 
That  is,  for  model  described  in 

(1)  V  i  “V  /i  is  calculated  for  all  i , 

'  1  jk  ij(k  -  1 )  ’ 

j  and  kr2,,..p  .  TTiese  differer^ces  are  reparo- 

meterized  such  that 


the  change  in  the  response  for  the  i  treat¬ 
ment  group  from  time  k-1  to  time  k  ,  The 
test  for  parallelism  of  the  growth  curves  be¬ 
comes 

U  .  .(J)  -  .(2)  .  .  .(t) 


u  .  A^)  .(2 

H  :  A  =  A 
“  k-1  k  k-1  k 


A'"'  for  k=2,...p. 
k-1  k  (2) 


This  hypothesis  can  be  tested  by  creating  a 
nt(p-l)  vector  of  the  differences,  say  ^ 

and  forming  a  vector  of  parameters  A 

{p-lTt  X 1 

consisting  of  the  A^^^  in  order  kr2,...  p 
k-1  k 

in  sequence  i  =  l,  2,  ...  t  .  Then 

E(^)  =  X  A  where  X  is 

nt(p-l)  X  t(p-l)  t(p-l)xl 

a  nt(p-l)  x  t(p-l)  design  matrix  containing 

t,n(p-l)  X  (p-1)  submatrices  of  one  and  zeros 

on  the  main  diagonal  and  all  submatrices  off 
the  main  diagonal  are  zero.  The  j'^^'.^submatrix 
on  the  diagonal  corresponds  to  the  treat¬ 

ment  group  and  the  t,  V  element  of  X  is 

ri  if  E-  =  (t-l)(p-l) +£modC(p-l)] 
Xf£  -J  where  i  is  the  treatment  group 
’  of  the  E  th  element  of 

0  otherwise 

The  first  (p-1)  columns  of  X  correspond  to 
the  (p-1)  differences  from  treatment  group  1 
(irl),  the  second  (p-1)  columns  correspond  to 
the  (p-1)  differences  from  treatment  group  2 
(i-2),  etc.  To  test  given  in  (2)  a  res¬ 
tricted  design  matrix,°Xp  ,  and  parameter 

vector,  ^  ,  is  used.  The  design  matrix  Xp 
is  a  nt(p-l)  X  (p-1)  matrix  consisting  of 
ones  and  zeros  such  that  the  t  I’  element  of 
Xp  is 

I  1  t'  =  t  mod  (p-1) 

Xp(f,f’)  =  ^ 

I  0  otherwise 

The  parameter  vector  ^  is  a  (p-1)  vector  of 
change  parameters  such  that  the  Eth  element  of 
^  represents  the  change  from  time  period  t 
to  time  period  f  +  1  for  all  treatment  groups. 
Then  the  null  hypothesis  given  in  (2)  is  tested 
by  fitting  the  full  model  E(Y^)  = X A  and  the 
model  restricted  by  tlie  null  hypothesis 
E(Yj)  =  Xp ^  and  calculating  a  sum  of  squares 

for  regression  for  each.  Then  the  test  statis¬ 
tic  is 


F 


Y-^CX(X-X)-^X-  - 

Y-^CI  -  X(X'X)"^X']]^/t(n-l)Cp-l) 


(c-4)  Ihis  analysis  uses  the  Successive  Differ¬ 
ence  statistic  calculated  in  (c-3)  but  incorpor¬ 
ates  the  Geisser-Greenhouse  correction  given  by 

E  =  2(p^-l)/(3p^-^ip+2)(See  Schwertman  and  Heil- 
brun  (1984)  to  adjust  degrees  of  freedom. 


(c-5)  Hotelling's  T-Square  statistic  using  the 
complete  Multivariate  data. 


(c-6)  Hotelling's  T-Square  statistic  using  the 
estimated  regression  coefficients  as  data.  For 
each  experimental  unit  the  data  is  summarized  by 
estimating  the  regression  coefficients  for  a 
quadratic  growth  curve.  Since  interest  is  pri¬ 
marily  in  parallelism  of  the  growth  curves  only 
the  coefficients  of  the  linear  and  quadratic 
terms  in  time  are  used  as  bivariate  data  for  the 
Hotelling's  T-Square  statistic. 


The  following  methods  of  analysis  are  compared 
for  both  significance  level  and  power  in  the 
Monte  Carlo  study  of  incomplete  growth  curve 
data. 


(I-l)  The  split-plot  analysis  similar  to  that 
described  by  (c-1)  except  that  the  degrees  of 
freedom  for  error  is  N-nt-pt+t  where  N  is  the 
total  number  of  observations  in  the  entire  data 
set . 


(1-2)  The  split-plot  analysis  in  (I-l)  with  the 
degrees  of  freedom  adjusted  with  the  Geisser- 
Greentiouse  correction  computed  using  the  esti¬ 
mated  dispersion  matrix  calculated  from  the  in¬ 
complete  data. 

CI-3)  Split-plot  analysis  in  (I-l)  with  the 
degrees  of  freedom  adjusted  with  the  Geisser- 
Greenhcuse  correction  computed  using  the  esti¬ 
mated  dispersion  matrix  from  the  incomplete  data 
which  is  smoothed  if  necessary.  (See  Schwertman 
atid  Allen  (1979)  and  Huseby,  Schwertman,  atid 
Alien  (1980)) 

(1-4)  The  Successive  Difference  Analysis  des¬ 
cribed  in  (c-3)  using  the  Incomplete  data.  If 
one  or  more  observations  are  missing  at  time 
periods  between  observations  then  the  successive 
difference  will  estimate  more  than  one  A 


.(i) 


i(i) 


A'  '  A'  '  A'  '  .  A  simi- 

k  (k+1)  (k+1)  (k+2)  *  (k+2)  (k+3) 

lar  procedure  is  used  for  more  than  two  obser¬ 
vations  missing  between  observations.  The  mod¬ 
el  given  in  (c-3)  is  the  same  however  the  X 
and  Xp  matrix  must  be  adjusted  to  correspond 

to  the  differences  in  ^  and  now  may  contain 

more  than  Just  a  single  one  in  each  row  as  is 
the  case  with  complete  data.  If  the  element  in 
is  a  difference  calculated  with  one  missing 


value  in  between  then  the  corresponding  row  in 
both  X  and  Xp  will  have  exactly  two  consecu¬ 
tive  ones  in  it.  If  there  were  two  missing 
values  then  that  row  of  X  and  Xp  would  have 


three  consecutives  ones  in  it  and  so  forth. 
The  test  statistic  is  calculated  in  the  same 
manner. 


(1-5)  The  Successive  Difference  Analysis  des¬ 
cribed  in  (1-4)  with  the  degrees  of  freedom 
adjusted  for  the  dispersion  structure  using 


2(p  -1) 


I. 


(See  Schwertman  and  Heil- 


(3p  -4p+2) 

brun  (1984)  and  using  the  maximum  number  of 
observations  per  experimental  unit  (the  most 
conservative  e)  as  p. 


(1-6)  The  Successive  Difference  Analysis  des¬ 
cribed  in  (1-41  uith  the  degrees  of  freedom 
adjusted  for  tti*  dispersion  structure  using 
2  2 

e  :  2(p  -  l)/(3p  -  4p  +  2)  (See  Schwertman 

and  Heilbrun  (1984)  and  using  the  average  num¬ 
ber  of  observations  per  experimental  unit  as  p 

(1-7)  The  Hotelling's  T-Square  using  the  esti¬ 
mated  regression  coefficients,  described  in 
(c-6),  as  data. 


(1-8)  The  Split-Plot  Analysis  using  the  incom¬ 
plete  data  with  estimates  inserted  for  missing 
observations.  The  estimates  of  missing  values 
are  obtained  by  determining  the  estimate  regres 
sion  coefficients  for  a  quadratic  growth  curve 
and  using  this  equation  to  fill  in  the  missing 
observations.  The  split-plot  is  analyzed  as 
described  in  (c-1)  with  the  degrees  of  freedom 
for  error  adjusted  by  subtracting  out  a  degree 
of  freedom  For  every  missing  observations  esti¬ 
mated. 


parameter.  For  example,  if  observation  y..  and 

I  jk 

y.  T.are  observed  but  y.  ,,  is  missing, 
ij(kt2)  'ij(k+l) 

the  difference  y.  ...  -  y.  ,,  is  used  to  esti- 

'ij(k+2)  'ijk 

mate  the  sum  of  two  A  parameters, 

*  (k*l)*‘(L2)  •  Similarly  if  y..^ 
^J(k*3)  observed  by  and 

>ij(l<^2)  then  -  Yjjp  is 

used  to  estimate  the  sum  of  three  A  parameters 


(1-9)  The  Split-Plot  Analysis  described  in 
(1-0)  using  the  degrees  oT  freedom  further  ad¬ 
justed  with  the  Geisser-GreerUiouse  correction, 

(I-IO)  Hie  Successive  Difference  Analysis  des¬ 
cribed  in  (c-3)  using  the  data  with  estimates 
of  tlie  missing  values  included  and  the  degrees 
of  freedom  for  error  adjusted  by  subtracting  a 
degree  of  freedom  for  error  for  every  missing 
value  estimated  and  then  applying  the  Geisser- 
Creentiouse  correction. 

(I-Il)  The  Hotelling's  T-Square  analysis  using 


ITl  ITlUlTtl'- 'Vjk 


the  data  with  the  estimates  of  missing  values  in¬ 
cluded. 

III.  An  Example 

To  illustrate  the  various  analysis  procedures 
compared  in  the  Monte  Carlo  study,  consider  a 
portion  of  the  Grizzle  and  Allen  (1969)  data  for 
the  Coronary  Sinus  Potassium  levels  of  dogs.  The 
data  is: 

Control  (Group  I) 

Time  Periods 


Dog 

1 

2 

3 

4 

5 

6 

7 

1 

4.0 

4.0 

4.1 

3.6 

3.6 

3.8 

3.1 

2 

4.2 

4.3 

3.7 

3.7 

4.8 

5.0 

5.2 

3 

4.3 

4.2 

4.3 

4.3 

4.5 

5.8 

5.4 

1 

4 

4.2 

4.4 

4.6 

4.9 

5.3 

5.6 

4.9 

5 

4.6 

4.4 

5.3 

5.6 

5.9 

5.9 

5.3 

6 

3.1 

3.6 

4.9 

5.2 

5.3 

4.2 

4.1 

7 

3.7 

3.9 

3.9 

4.8 

5.2 

5.4 

4.2 

8 

4.3 

4.2 

4.4 

5.2 

5.6 

5.4 

4.7 

9 

4.6 

4.6 

4.4 

4.6 

5.4 

5.9 

5.6 

-• 

Treated  (Group  VI) 

-■ 

Dog 

1 

2 

3 

4 

5 

6 

7 

1 

3.1 

3.5 

3.5 

3.2 

3.0 

3.0 

3.2 

I 

2 

3.3 

3.2 

3.6 

3.7 

3.7 

4.2 

4.4 

■ 

3 

3.5 

3.9 

4.7- 

4.3 

3,9 

3.4 

3.5 

4 

3.4 

3.4 

3,5 

3.3 

3.4 

3.2 

3,4 

5 

3.7 

3.8 

4.2 

4.3 

3.6 

3.8 

3.7 

6 

4.0 

4.6 

4.8 

4.9 

5.4 

5.6 

4.8 

7 

4.2 

3.9 

4.5 

4.7 

3.9 

3.8 

3.7 

1 

8 

4.1 

4.1 

3.7 

4.0 

4.1 

4.6 

4.7 

9 

3.5 

3.6 

3.6 

4.2 

4.8 

4.9 

5.0 

(Underlined  values  were  deleted  for  the  incom¬ 
plete  data  analysis  to  simulate  the  missing  data.) 

For  the  split-plot  analysis,  (c-1),  the  data  is 
analyzed  using  regression  and  the  full  model 


y...  =  p  +  T.  +  s.,.,+P, 
'ijk  ^  1  j(i)  k 

i  =  1,2;  j  =  1,2,.. 


+  TP.,  +  e.  where 
ik  ijk 


i  =  1,2;  j  =  1,2, ...9  and  k  =  1,2, ...7,  T^  is 

the  treatment  effect,  S.,,,  is  the  subject 
th 

effect  within  the  i  treatment  group,  Pj^  is  the 

period  effect,  TP^^j^  is  the  interaction  between 

treatment  and  period  and  e.  and  S.,.,  are 

ijk  j(i) 

random  components. 

The  regression  is  done  by  stacking  the  observa¬ 
tion  vectors  on  each  subject  and  the  design  mat¬ 
rix  X  consist  of  elements  X(E,m)  defined  as 
follows: 


f-1  if  ^  observation  in  y 
is  from  the  control 
group 

1  otherwise 

fl  if  observation  in  y  is 

from  tlie  j*'*’  subject  in  the  con¬ 
trol  group  j  :  2, . . .9 

0  otherwise 

{1  if  observation  in  y  is 

from  the  j*"^  subject  in  the 
treated  group  j  =  2,... 9 

0  otherwise 

for  the  observation  in  y  the 
X(E  18  )  -  appropriate  orthogonal  polynomial 

’  ^  “  coefficient  for  the  observation 

period  using  the  Anderson  and  Hous- 
man  (19A1)  tables  (q  =  1  are  the 
linear  coefficients,  g  =  2  are 
guadratic,  g  =  3  cubic,  etc.  g  =  1, 
2, ...5) 


X(^,23+g)  =  X(f ,2)*X(£,18+g)  and  are  the  inter¬ 
action  columns  in  X. 

To  test  that  TP.|^  =  0  for  every  i,k  the  sta¬ 
tistic  is  F  z  (SSRj.-  SSR^)/(df»HSE)  where  SSRj- 
and  SSR^  are  the  sum  of  sguares  for  regression 

for  the  full  and  restricted  model  respectively, 
HSE  is  the  mean  sguare  error,  and  df  z  5  (fifth 
degree  polynomial  was  assumed  adeguate  to  des¬ 
cribe  the  response  over  time).  Then 
F  z  (2359.81752  -  2357.04539)/(5(2377.92  - 
-  2359.81752)798)  z  3.001A.  This  statistic  is 
compared  to  the  critical  values  for  an  F  with 
degrees  of  freedom  5,  98. 

For  the  split-analysis  using  the  Geisser  Green¬ 
house  correction  (c-2)  the  estimated  dispersion 
matrix  is  needed.  The  pooled  estimated  disper¬ 
sion  matrix  for  the  data  is 


.1832  .1269  .0647  .0799  .0995  .2053  .1694 
.1332  .0876  .0909  .1324  .1885  .1390 
.2668  .2652  .1889  .1003  .0372 
S  =  .4014  .3619  .2672  .1570 

.4997  .4492  .3560 
.6499  .5340 

Symmetric  .5511 


and  the  Geisser-Greenhouse  correction  is 
e  =  .4243652  ,  Thus  the  F  z  3.0014  is  compared 
to  the  critical  values  of  an  F  with  degrees  of 
freedom  e’5  ,  c‘98  or  2,42  to  the  nearest 
integer  value. 

For  the  successive  difference  analysis  (c-3) 
the  F  statistic  is  calculated  using  a  similar 
procedure  as  described  for  the  split-plot  ana¬ 
lysis  with  the  full  model  consisting  of  twelve 
fl  parameters,  six  for  each  treatment  group,  and 
a  restricted  model  consisting  of  only  one  set  of 
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six  A  parameters  for  both  groups.  Using  the 
complete  coronary  sinus  potassium  data  the 
vector  of  difference  ^  ,  the  design  matrix, 

X  ,  parameter  vector,  A  ;  and  restricted  design 
matrix,  ,  and  corresponding  parameters 
vectors  are  given  below. 


3. 1- 3.8 
3. 8-3. 6 

3. 6- 3. 6 

3.6- 6. 1 

6. 1- 6.0 
6. 0-6.0 
5. 2-5.0 


6. 6-6. 6 
6. 6-6. 6 

3. 2- 3.0 
3. 0-3.0 
3. 0-3. 2 

6. 2- 3.6 

3. 6- 3. 6 

3. 6- 3. 5 

108x1 


,X= 


00000100000  0 
00001000000  0 
000100000000 
001000000000 
010000000000 
lOOOUOUOOOOO 
000001000000 


010000000000 

100000000000 

000000000001 

000000000010 

000000000100 

Oli  0000001000 
000000010000 
000000100000 


108x12 


then  1  =  (SSR^  -  5bR^)/6/M'jE  =  (5.8655  -  3.6372) 

This  statistic 
with 


I,  7  1  .h5  -  5.8655)/96  =  2.6615. 

js  comfjared  to  the  critical  value  of  an  f 
‘legrees  of  freedom  6,  96  . 


For  Flip  successive  differertce  analysis  using  the 
Oeisser-Oreenhouse  correction  (c-6)  the  same  F 
statistic  is  compiled  but  the  degrees  of  freedom 
are  adjusted  by  multiplying  by 


2lp  -  Ij 


2(68) 


3p  -  6p 


3(69) 


2B  *  2 


.l^iU 


Ihe  F  15  ttien  compared  to  the  critical 
value  of  an  F  with  degrees  of  freedom  5,  76  . 

For  flip  Flotelling's  t-Sguare  analysis  on  the 
data  1  c-5  ; ,  ttie 


l“2 

A<1) 

2^3 

a(1) 

3^ 

A<^’ 

6*5 

5*6 

a(1) 

6*7 

a(2) 

1*2 

a^2) 

2^3 

3*i^^ 

6*^^^ 

A<2^ 

5^6 

(2) 

6A7 


12x1 


0  0  0  0  0  1 
0  0  0  0  1  0 
0  0  0  1  0  0' 
0  0  1  0  0  0 
0  1  0  0  0  0 


1  0  0  0  0  0 

1*2 

0  0  0  0  0  1 

A 

2  3 

. 

A 

• 

3  4 

0  1  0  0  0  0 

6*5 

1  0  0  0  0  0 

0  0  0  0  0  1 

5*6 

0  0  0  0  10 

6*7 

0  0  0  1  0  0 

- 

6x1 

0  0  1  0  0  0 
0  1  0  0  0  0 
1  0  0  0  0  0 


108x6 


T  =  29.9161622 

F  = 


Nj  +  -  p 


and  the  test  statistic  is 
2  11 


(N^  +  -  2)(p 


T 


1)  ■  (16)(6) 

(29.9141422)  =  3,4277  ,  This  statistic  is  com¬ 
pared  to  the  criticsl  value  of  the  F  with 
degrees  of  freedom  6,  11  .  (See  Morrison 
(1967),  page  145.) 

For  the  Hotelling’s  T-Square  analysis  on  the 
data  summarized  in  the  estimated  regression  co- 


efficierits  (c-6)  (6 


®2 


in  the  model 


y -  6q  +  Bjt  +  62t  +  e)  the  following  are  the 

estimated  regression  coefficients  for  each  sub¬ 
ject  used  in  the  analysis. 


w  ' 


I. 


I 


_•  •.'A.'  .  -i  --  . 


Control 


Treated  (Group  4) 


6 

B 

6 

3.97163 

.06190 

-.02381 

6.58571 

-.66167 

.07976 

6.62857 

-.22738 

.05833 

3.55716 

.56762 

-.06526 

3.56286 

.82262 

-.07738 

1.68571 

1.63095 

-.18333 

2.61629 

.07301 

-.08333 

3.31629 

.72301 

-.06905 

^0 

61 

^2 

3.27163 

.03333 

-.00952 

3.21629 

.02163 

.02163 

3.00000 

.69762 

-.09526 

3.65716 

-.02738 

.00119 

3.66286 

.32163 

-.06286 

3.27163 

.76905 

-.07381 

3.78571 

.36663 

-.05357 

6.38571 

-.32381 

.05676 

Since  it  is  desired  to  test  parallelism  of  the 
response  curves  over  time  only 

~  *  2 

and  estimates  are  used.  The  T  for 

the  data  summarized  in  6|.  $2  is  3.673376  and 

the  test  statistic  is 


+  N2  -  p  -  1)  2 

(Nj  +  N2  -  2)p  ^ 


1.7219  . 


This  statistic  is  compared  to  the  critical  value 
of  the  r  with  degrees  of  freedom  2,  IS  (See 
Morrison  (1967),  page  125). 


For  the  split-plot  analysis  of  the  incomplete 
data  (I-l)  the  same  models  are  used  as  for  the 
complete  data,  but  with  the  missing  observations 
omitted.  The  F  statistic  for  testing  no  in¬ 
teraction  is 

F  :  (SSRr  -  SSR  )/5/HSE  =  (1086.01889  - 
f  r 

1883.57116)/5/(1899.5  -  1886.01889)/72  :  2.6166. 

Ttiis  statistic  is  compared  to  the  critical  value 
for  an  F  with  degrees  of  freedom  5,  72  . 


For  the  split-plot  analysis  for  the  incomplete 
data  adjusted  with  the  Gcisser-Greenhouse  cor¬ 
rection  (1-2)  the  estimated  dispersion  matrix 
for  the  incomplete  data  is: 


and  the  Geisser-Greenhouse  correction  is 
£  =  .3256127.  Thus  F  =  2.6166  is  compared 
to  the  critical  value  of  an  F  with  degrees 
of  freedom  2,23  . 

For  the  split-plot  analysis  using  the  Geisser- 
Greenhouse  correction  calculated  from  the 
smoothed  estimate  of  the  dispersion  matrix 
(1-3),  the  smoothed  estimated  dispersion  ma¬ 
trix  is  needed.  The  dispersion  matrix  given 
in  (1-2)  was  not  at  least  positive  semi-defi- 
nite.  Usitig  the  Schwertman  and  Allen  (1979) 
smoothing  procedure  to  find  the  "closest" 
positive-semidefinite  matrix  to  the  original 
estimate,  the  smoothed  estimated  dispersion 
matrix  is 


1918065  .1654627 

.0030270 

-.0331534 

.0176511 

.1401976 

.1242361 

.2110937 

.0936510 

.0109829 

.1849782 

.2619820 

.2576760 

.2797611 

.1500375 

.1534341 

-.0038301 

.0466650 

.1680153 

.1175633 

-.0425321 

.0112487 

.5354160 

.4537057 

.4972162 

.6469337 

.5988344 

Symmetric 

.5880248 

and  the  Geisser-Greenhouse  correction  is 
e  =  .3468859  .  Thus  the  F  =  2.6146  is  com¬ 
pared  to  the  critical  value  of  an  F  with 
degrees  of  freedom  2,25  . 


For  the  successive  difference  analysis  of 
incomplete  data  (1-4)  the  same  procedure  as 
described  in  (c-3)  is  used.  For  the  incomplete 
coronary  sinus  potassium  data  the  vector  of 
differei\ces,  ,  the  design  matrix,  X  ;  the 

parameter  vector,  6  j  the  restricted  design 
matrix,  Xp  ;  and  the  corresponding  parameter 


_ 

— 

— 

— 

— 

3. 1-3.8 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

l“2 

0 

0 

0 

0 

0 

1 

3. 8-3. 6 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

2  3 

3. 6-4.0 

1 

1 

1 

1 

0 

0 

Q 

0 

0 

0  0 

0 

1 

1 

1 

1 

0 

0 

5. 2-5.0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

3  4 

0 

0 

0 

0 

0 

1 

5. 0-4. 8 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

4*5 

0 

0 

0 

0 

1 

0 

4.8-3. 7 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

a'1^ 

5“6 

0 

0 

1 

1 

0 

0 

3. 7-4. 2 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

A<') 

(Tl 

1 

1 

0 

0 

0 

0 

5.8-4. 5 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

; 

1  2 

4. 4-4. 6 

,  X- 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

6  = 

.Xp  = 

0 

1 

0 

0 

0 

0 

2  3 

4. 6-4. 6 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

3. 2-3.0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

A<^> 

3“4 

0 

0 

0 

0 

1 

1 

A<^> 

3.0-3. 5 

0 

0 

0 

0 

u 

0 

0 

0 

1 

1 

0 

0 

4*5 

0 

0 

1 

1 

0 

0 

3. 5-3. 5 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

A<^^ 

5“6 

0 

1 

0 

0 

0 

0 

4. 2-3. 6 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

I 

0 

a<2> 

0 

0 

1 

1 

1 

0 

* 

6  7 

( 

.  — 

3. 6-3. 6 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

3. 6-3. 5 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

U—  -J 

— 

_ 

— 

— J 

6> 


82x1 


82  X  12 


12x1 


82x6 


Then  F  =  (SSR^  -  SSRj.)/6/MSE  =  (5.0244  -  2.9624)/ 

6/(14.5356/70)  :  1.6374.  This  statistic  is  com¬ 
pared  to  the  critical  value  of  an  F  with  deg¬ 
rees  of  freedom  6,  70.  . 

For  the  successive  difference  analysis  using  the 
Geisser-Greenhouse  correction  (1-5)  the  correc- 
2  2 

tion  factor  is  c  =  2(p  -l)/(3p  -4p  +  2)  with 
p  :  7,  e  =  .7934  and  the  F  statistic  calculated 
in  (1-4),  F  =  1.6374  is  compared  to  the  critical 
value  of  an  F  with  degrees  of  freedom  6  •  e, 

70  •  e  or  5,  56  to  the  nearest  integer. 

For  the  successive  difference  analysis  using  the 
Geisser-Greenhouse  correction  based  on  the  aver¬ 
age  number  of  responses  per  experimental  unit 

(1-6)  P  =  :  5.56  and  €  =  2C(5.56)^-  11/ 

2 

C3(5.56)  -  4(5.56)  +  21  =  .8252.  The  Fr  1.6374 

is  compared  to  the  critical  value  of  an  F  with 
degrees  of  freedom  5,  58. 

For  the  Hotelling's  T-square  analysis  on  the 
data  summarized  in  the  .estimated  regression 
coefficients  (I-7)(6q,  model 

y  z  Bg  +  Bjt  +  $2^^  *  ®)  the  following  are  the 

estimated  regression  coefficients  for  each  sub¬ 
ject  used  in  the  analysis  based  on  the  incom¬ 
plete  data  set. 


Control 

60 

Bl 

B2 

3.85721 

.16874 

-.03725 

4.23614 

-.21871 

.05442 

5.86000 

-1.14571 

-.18511 

3.64000 

. 54000 

-.05000 

3.20000 

1.00000 

-.10000 

1.48571 

1.63095 

-.18333 

2.61429 

.87381 

-.08333 

3.31429 

.72381 

-.06905 

4.52857 

-.06905 

.03810 

Treated  (Group  4) 

Bo 

Bl 

B2 

4.28769 

-.42977 

.03857 

3.17403 

.04503 

.02127 

3.06857 

.68619 

-.09524 

3.41429 

.01071 

-.00357 

3.21571 

.49179 

-.06179 

Treated  (Group  4) 

(continued) 

Bo 

K 

B2 

3.09143 

.85071 

-.08214 

4.00429 

.31000 

-.05357 

4.57000 

-.48250 

.08036 

3.11429 

.24881 

.00595 

Since  it  is  desired  to  test  parallelism  of  the 

response  curves  over  time  only  6,  and  B,  esti- 
2  1  z 

mates  are  used.  The  T  for  the  data  summar¬ 
ized  in  Bj,  82  is  3.5279527  and  the  test  sta¬ 
tistic  is  F  z  (Nj  +  N2  -  p  -  1)T^/ 

(Nj  +  -  2)p  =^j^2j’  ^-5279527  =  1.6537. 

This  statistic  is  compared  to  the  critical  value 
of  an  F  with  degrees  of  freedom  2,  15. 

The  analysis  procedures  (1-8),  (1-9),  (I-IO), 
and  (1-11)  all  use  estimates  for  any  missing 
data  based  on  the  regression 
2 

y  z  Bg  +  B^t  +  82^  <'or  that  particular  subject. 
In  describing  (1-8)  the  estimates  for  Bg,  Bj , 

$2  for  each  subject  were  given.  These  estimates 
for  Bg,  Bj,  B2  Tor  each  subject  are  used  to 

estimate  any  missing  values  for  that  subject. 

The  following  is  the  incomplete  data  with 
estimates  for  any  missing  observation. 


CONTROL 
Time  Period 


Doq 

1 

2 

3 

4 

5 

6 

7 

1 

4.0 

4.0456763 

4.0281596 

3.9361419 

3.6 

3.8 

3.1 

2 

4.2 

4.0164179 

3.7 

4.2320896 

4.8 

5.0 

5.2 

3 

4.9000000 

4.2 

4.3 

4.3 

4.5 

5.8 

6.9400000 

4 

4.2 

4.5200000 

4.6 

5.0000000 

5.3 

5.0800000 

4.9 

5 

4.1000000 

4.8Q00000 

5.3 

5.6 

5.7000000 

5.6000000 

5.3 

6 

3.1 

3.6 

4.9 

5.2 

5.3 

4.2 

4.1 

7 

3.7 

3.9 

3.9 

4.8 

5.2 

5.4 

4.2 

8 

4.3 

4.2 

4.4 

5.2 

5.6 

5.4 

4.7 

9 

4.6 

4.6 

4.4 

4.6 

5.4 

5.9 

5.6 

TREATED  (Group  4) 


1 

3.8964824 

3.5 

3.5 

3.1856784 

3.0 

3.0974874 

3.2 

2 

3.3 

3.2 

3.6 

3.6944751 

3.9309392 

4.2 

4.5314917 

3 

3.5 

4.0600000 

4.7 

4.3 

3.9 

3.4 

3.5 

4 

3.4 

3.4 

3.5 

3.400000 

3.4 

3.2 

3.4 

5 

3.7 

3.8 

4.2 

4.3 

4.1300000 

3.8 

3.7 

6 

3.8600000 

4.6 

4.8 

4.9 

5.4 

5.6 

4.8 

7 

4.2 

4.4100000 

4.5 

4.7 

3.9 

3.8 

3.7 

8 

4.1 

4.1 

3.7 

4.0 

4.1 

4.6 

5.1300000 

n 

3.5 

3.6 

3.6 

4.2 

4.8 

4.9 

5.0 

The 

.plit-plot 

analysis  with 

estimates  used 

for 

degrees  of  freedom  2,  28. 

the  missing  data  (1-8),  is  done  exactly  as  if 
the  data  were  complete  and  degrees  of  freedom 
for  error  is  adjusted  subtracting  a  degree  of 
freedom  for  each  missing  value.  The  test  sta¬ 


tistic  for  parallelism  is  F  =  (SSRj.  -  SSR^)/ 


(df  •  MSE)  r  (2395.87488  -  2393.54174)/ 
5(2419.06587  -  2395.87488)772  :  1.4487.  This 
statistic  is  compared  to  the  critical  value  of 
F  with  degrees  of  freedom  5,  72. 


The  success  difference  analysis  using  the  data 
with  estimates  for  the  missing  observations 
(I-IO)  is  performed  as  described  for  the  com¬ 
plete  data  successive  difference  analysis  (C-3) 
The  F  statistic  is  F  r  (S5R-  -  SSR  )/6/MSE  r 
(3. 3419-2. 1363)76(20. 8518-3.3419)770  1  .8033. 
The  Geisser-Greenhouse  correct ior)  is 

2(p^-  l)/(3p^-  4p  4-  2) 


e  = 


2(4B)  ^  -wg,. 

3(49)  -  28+2  ■ 


For  the  split-plot  analysis  with  estimates  used 
for  the  missing  data  adjusted  for  the  Geisser- 
Greetihouse  correction  (1-9)  the  dispersion  matrix 


l)/(3p‘'-  4p 

The  F  statistic  F  =  .8033  is  compared  to  the 
critical  value  of  the  F  with  degrees  of  free¬ 
dom  5 ,  56 . 


matrix  is 


5= 


.1828  .1464 
.2698 


18  needed. 

The  dispersion 

0149  - 

.0152 

.1388 

.2098 

1426 

.1338 

.1701 

.1256 

2380 

.1844 

.0798 

.0239 

3082 

.3125 

.1991 

.0868 

.4649 

.4073 

.2900 

.6054 

.5920 

.8649 

the  Hotelling's  T-square  analysis  using  the  data 
with  estimates  for  the  missing  observations 
(I-ll)  is  performed  as  described  for  the  complete 
data  Hotelling's  T-square  analysis  (C-5).  The 
.2 


T  =  9.0751902  and  the  test  statistics  is 
N. 


F  = 


1 


-  p 


11 


(N 


N^  -  2)(p  -  D’ 


(16)(6) 


(9.0751902) 


=  1.0399.  This  statistic  is  compared  to  the 
critical  value  of  the  F  with  degrees  of  free¬ 
dom  6,  11 . 


TV. 


and  the  Geisser-Greenhouse  correction  factor 
c  =  .3945405.  The  F  statistic  F  r  1.4487  is 
compared  to  the  critical  value  of  an  F  with 


The  Monte  Carlo  Study 


To  compare  llie  various  analysis  procedures  des¬ 
cribed  in  Section  II  500  simulations  of  14  dis- 


*>.  • . 

V  ' 


persion  structures  with  ssmple  sizes  of  n  =  10 
snd  n  =  20  (either  5  or  10  in  esch  of  two  trest- 
ment  groups).  The  dispersion  structures,  des¬ 
cribed  in  tables  1  and  2  used  in  the  simulation 
represent  a  variety  of  structures  with  either 
four  or  eight  multivariate  responses  and  differ¬ 
ing  patterns.  Two  data  sets  were  generated,  one 
where  both  treatment  groups  have  the  same  growth 
curve,  to  measure  significance  level  and  the 
second  set  where  the  growth  curves  for  the  two 
groups  are  different  in  order  to  compare  power 
of  the  testing  procedures.* 

Uniform  random  variables  were  generated  using 
the  subroutine  RAND  in  the  Control  Data  Corpor¬ 
ation  FORTRAN  LIBRARY  ROUTINES.  Generation  of 
the  normal  random  variables  was  accomplished  by 
Box  and  Mueller  (1950)  transformation.  To 
achieve  the  desired  dispersion  structures  given 
in  tables  1  and  2,  the  Cholesky  factorization  of 
the  dispersion  matrix,  £,  waa  used.  That  is.  If 
T'T  r  E  where  T  is  a  p  x  p  upper  triangular 
matrix  then  the  vector  of  random  components  for 
each  multivariate  response  vector  is  T'e  where  £ 
are  the  independent  normal  variables  generated 
using  the  Box-Mueller  (1958)  procedures. 

The  proportion  of  non  positive  semidefinite  esti¬ 
mated  dispersion  matrices  encountered  in  the 
simulations  is  tabulated  in  Table  3  and  the 
simulated  significance  level  and  power  are  pre¬ 
sented  in  tables  A  through  table  9. 

M.  Conclusions 


T-square  using  the  estimated  regression  coeffi¬ 
cients  on  each  subject  seemed  to  be  the  more 
satisfactory  both  with  respect  to  significance 
levels  and  power  than  using  Hotelling's  T-square 
on  the  original  data. 

For  the  simulation  of  incomplete  growth  curve 
data  the  most  satisfactory  of  the  procedures 
compared  was  the  Split-Plot  Analysis  using  the 
Ceisser-Greenhouse  correction  factor  calculated 
from  the  smoothed  estimated  dispersion  matrix. 
This  procedure  had  consistently  satisfactory 
simulated  significance  levels  and  relatively 
large  power. 

Wl>ile  both  tlie  split-plot  analysis  with  the 
Geisser-Greenhouse  correction  calculated  on  the 
smoothed  estimated  dispersion  matrix  (1-3)  and 
the  Hotelling's  T-square  analysis  on  the  data 
summarized  in  the  estimated  regression  coeffi¬ 
cients  (1-7)  have  satisfactory  significance 
levels  for  both  sample  sizes  and  both  numbers 
of  multivariate  responses  the  former  had  greater 
power.  Since  the  uncorrected  split  plot  analy¬ 
sis  (1-1)  generally  had  much  larger  simulated 
significance  levels  than  the  nominal  value,  its 
use  is  questionable.  The  split-plot  using  the 
Geisser-Greenhouse  correction  on  the  estimated 
dispersion  matrix  before  smoothing  (1-2)  had 
much  smaller  significance  levels  than  the  nomi¬ 
nal  values  and  was  also  substantially  less 
powerful  ttian  the  same  procedure  using  the 
smoothed  dispersion  matrix. 


'■.rvA 


This  Monte  Carlo  study  using  a  variety  of  multi¬ 
variate  dispersion  structures  suggest  that  for 
complete  growth  curve  data  the  preferred  analy¬ 
sis  procedure  of  those  compared  is  either  the 
split-plot  with  the  Geisser-Greenhouse  correc¬ 
tion  or  Hotelling's  T-square  using  ttie  estimated 
regression  coefficietits  on  each  subject  as  the 
data.  The  split-plot  analysis  with  the  Geisser- 
Greenhouse  correction  seems  to  be  more  satisfac¬ 
tory  for  small  numbers  of  multivariate  responses 
since  it  had  better  power.  However  the  T-square 
was  more  satisfactory  with  respect  to  signifi¬ 
cance  level  for  eight  multivariate  responses  and 
should  be  preferred  in  this  case.  The  Geisser- 
Greenhouse  correction  brought  ttie  inflated  sim¬ 
ulated  significance  levels  for  ttie  split-plot 
analysis  closer  to  the  nominal  values  but  still 
witti  eight  multivariate  responses  ttie  simulated 
significance  levels  tended  to  be  somewtiat  larger 
than  the  nominal  values.  The  successive  differ¬ 
ence  procedures  which  were  developed  for  ttie 
analysis  of  data  observed  at  random  times  did 
not  adapt  well  to  the  complete  data,  generally 
tinvinq  simulated  significance  levels  larger 
tlian  norminal  level.  The  two  multivariate 
procedures  using  Hotelling's  T-square  were 
satisfactory  with  respect  to  ttie  simulated 
significance  levels  but  less  powerful  than 
ttie  split-plot  analyses.  The  Hotelling's 
'  ttie  growtti  curves  used  in  this  study  ore  based 
on  ttie  real  growth  curve  of  approximately  75 
bulls  at  ttie  University  of  Kentucky  Agricultural 
txperiment  Station. 


The  successive  difference  procedures  (1-4,  1-5 
1-6)  developed  for  data  observed  at  random 
times,  again  did  not  adapt  well  to  data  at 
fixed  times.  These  procedures  had  significance 
levels  larger  than  the  nominal  values  especially 
for  eight  multivariate  responses  or  larger  sam¬ 
ple  sizes. 

The  test  procedures  using  the  data  with  esti¬ 
mates  for  the  missing  observations  (1-8,  1-9, 
I-IO,  I-ll)  tended  to  have  much  smaller  simu¬ 
lated  significance  levels  than  the  nominal 
values.  Since  the  split-plot  analysis  using 
the  Geisser-Greenhouse  correction  after  smooth¬ 
ing  (1-5)  generally  was  more  powerful  than  any 
of  these  procedures,  it  seems  to  be  more  appro¬ 
priate  for  the  incomplete  data  analysis  than 
either  procedures  1-8,  1-9,  I-IO,  or  1-11. 


TABLE  I 


Dispersion  structures  for  four  multivariste  response  simulations 


Structure 

°11 

a  22 

<’33 

<’44 

<’l2 

(P12) 

°1} 

(P13) 

<’14 

(P14) 

<’23 

(P23) 

<’24 

(P24’ 

<’34 

(P34) 

4A 

1.0 

1.0 

1.0 

1.0 

.8 

(.8) 

.8 

(.8) 

.8 

(.8) 

.8 

(.8) 

.8 

(.8) 

.8 

(.8) 

4B 

1.0 

1.0 

1.0 

1.0 

.2 

(.2) 

.2 

(.2) 

.2 

(.2) 

.2 

(.2) 

.2 

(.2) 

.2 

(.2) 

4C 

1.0 

1.0 

1.0 

1.0 

.8 

(.8) 

.6 

(.6) 

.4 

(.4) 

.8 

(.8) 

.6 

(.6) 

.8 

(.8) 

40 

1.0 

1.0 

1.0 

1.0 

1.13137 

(.8) 

1.38564 

(.8) 

1.6 

(.8) 

1.95959 

(.8) 

2.26274 

(.8) 

2.77128 

(.8) 

4E 

1.0 

2.0 

3.0 

4.0 

.28284 

(.2) 

.34641 

(.2) 

.4 

(.2) 

.48990 

(.2) 

.56569 

(.2) 

.69282 

(.2) 

4F 

1.0 

2.0 

3.0 

4.0 

1.13137 

(.8) 

1.03923 

(.6) 

.8 

(.4) 

1.95959 

(.8) 

1.69706 

(.6) 

2.77128 

(.8) 

4G 

1.0 

2.0 

3.0 

4.0 

1.27279 

(.9) 

1.40296 

(.81) 

1.458 

(.729) 

2.20454 

(.9) 

2.29103 

(.81) 

3.11769 

(.9) 

»  * 
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TABLE  II 

Dispersion  structures  for  eight  multivariate  responses  simulations 


STRUC¬ 

TURE 

°11 

°Z2 

°33 

°44 

°55 

^66 

‘’77 

‘’ae 

‘’12 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

(^12) 

.9 

8A 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

(  .9) 

.2 

8B 

.  1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

(  .2) 

.8 

8C 

1.0 

2.0 

3.0 

4.0 

5.0 

6.0 

7.0 

8.0 

(  .8) 

1.13137 

8D 

1.0 

2.0 

3.0 

4.0 

5.0 

6.0 

7.0 

8.0 

(  .8) 

.28284 

8E 

1.0 

2.0 

3.0 

4.0 

5.0 

6.0 

7.0 

8.0 

(  .2) 

1.13137 

8F 

1.0 

2.0 

3.0 

4.0 

5.0 

6.0 

7.0 

8.0 

(  .8) 

1.27279 

8G 

•^13 

°1A 

°15 

°16 

‘’17 

‘’18 

‘’23 

‘’24 

(  .9) 

‘’25 

ki 

/7  .y 

>;.;v 

//- 

!f  ■' 

•"4 

m  i 

(^13) 

(^14) 

(*’15) 

(^16) 

(”17) 

(”l8) 

(”23) 

(”24) 

(”25) 

.9 

.9 

.9 

.9 

.9 

.9 

.9 

.9 

.9 

8A 

(  .9) 

(  .9) 

(.9) 

(.9) 

(.9) 

(.9) 

(.9) 

(.9) 

(.9) 

.2 

.2 

.2 

.2 

.2 

.2 

.2 

.2 

.2 

8B 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

.6 

.4 

.2 

0. 

-.2 

-.4 

.8 

.6 

.4 

8C 

(.6) 

(.4) 

(.2) 

(0) 

(-2) 

(-.4) 

(.8) 

(.6) 

(.4) 

1.38654 

1.6 

1.78885 

1.95959 

2.1166 

2.26274 

1.95959 

2.26274 

2.52982 

80 

(.8) 

(.8) 

(.8) 

(.8) 

(.8) 

(.8) 

(.8) 

(.8) 

(.8) 

.34641 

.4 

.44721 

.48990 

.52915 

.56569 

.48990 

.56569 

.63246 

8E 

(.2) 

(.2) 

(.21 

(.2) 

(.2) 

(.2) 

(.2) 

(.21 

(.2) 

1.03923 

.8 

.44721 

0. 

-.52915 

-1.13137 

1.95959 

1.69706 

1.26491 

8F 

(.6) 

(.4) 

(.2) 

(0.) 

(-.2) 

(-.4) 

(.8) 

(.6) 

(.4) 

1.40296 

1.458 

1.46708 

1.44642 

1.35833 

1.21142 

2.20454 

2.29103 

2.3053 

80 

(.81) 

(.729) 

(.6561) 

(.5905) 

(.5314) 

(.4283) 

(.9) 

(.81) 

(.729) 

V. 

‘’26 

°n 

‘’28 

‘’34 

'^35 

°36 

°37 

°38 

O45 

(”26) 

(”27) 

(”28) 

(”34) 

(P35) 

(P36) 

(”37) 

(”38) 

(”45) 

b 

.9 

.9 

.9 

.9 

.9 

.9 

.9 

.9 

.9 

m 

8A 

(.9) 

(.9) 

(.9) 

(.9) 

(.9) 

(.9) 

(.9) 

(.9) 

(.9) 

.2 

.2 

.2 

.2 

.2 

.2 

.2 

.2 

.2 

8B 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

.2 

0. 

-.2 

.8 

.6 

.4 

.2 

0. 

.8 

8C 

(.2) 

(0.) 

(-.2) 

(.8) 

(.6) 

(.4) 

(.2) 

(0.) 

(.8) 

2.77128 

2.99333 

3.2 

2.77128 

3.09839 

3.39411 

3.66606 

3.91918 

3.57771 

m 

80 

(.8) 

(.8) 

(.8) 

(.3) 

(.8) 

(.8) 

(.8) 

(.8) 

(.8) 

.69282 

.74833 

.8 

.69282 

.77460 

.84853 

.91652 

.97980 

.89443 

BE 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

(.2) 

r  * 

.69282 

0. 

-.8 

2.77128 

2.32379 

1.69706 

.91652 

0. 

3.57771 

r" 

8F 

(.2) 

(0.) 

(-.2) 

(.8) 

(.6) 

(.4) 

(.2) 

(0.) 

(.8) 

2.27280 

2.20945 

2.1256 

3.11769 

3.13712 

3.09289 

3.00663 

2.89285 

4.02492 

80 

(.6561) 

(.5905) 

(.5314) 

(.9) 

(.81) 

(.729) 

(.6561) 

(.5905) 

(.9) 

y. 


TABLE  III 

Proportion  of  Estimated  Dispersion 
Matrices  froai  the  incomplete  data  that  were  NOT  positive  definite. 


Sample 

size  4A  46  4C  40  4L  4E  4G  BA  BO  BC  BD  BE  BE  8G 

n  =  10  .930  .040  .456  .578  .040  .448  .692  .852  .008  .206  .570  .006  .218  .556 

n  =  20  .702  .000  .104  1.76  .000  .120  .434  .596  .000  .004  .042  .000  .002  .106 

TABLE  Ill 

MONTE  CARLO  SIMULATION  Of  SIGNIEICANCE  LEVEL  FOR  INCOMPLETE  DATA 
SAMPLE  S12E  N  z  10 


Dispersion  Structure 


Test 

Normal  a 

4A 

4B 

4C 

40 

4E 

4F 

4G 

8A 

BB 

BC 

BD 

8E 

BE 

BG 

.10 

.088 

.076 

.116 

.110 

.110 

.126 

.122 

.092 

.094 

.216 

.162 

.108 

.202 

.220 

I 

-1 

.05 

.026 

.044 

.068 

.060 

.048 

.084 

.076 

.038 

.030 

.142 

.098 

.052 

.166 

.158 

.01 

.008 

.010 

.020 

.020 

.012 

.032 

.034 

.008 

.006 

.060 

.036 

.018 

.068 

.084 

.10 

.018 

.044 

.058 

.040 

.048 

.066 

.060 

.008 

.020 

.072 

.032 

.032 

.080 

.082 

1 

-2 

.05 

.008 

.020 

.018 

.014 

.022 

.024 

.024 

.000 

.004 

.022 

.002 

.008 

.028 

.022 

.01 

.000 

.002 

.000 

.000 

.004 

.002 

.000 

.000 

.000 

.004 

.000 

.002 

.002 

.002 

.10 

.030 

.052 

.070 

.056 

.056 

.076 

.076 

.020 

.028 

.100 

.062 

.044 

.114 

.118 

I 

-3 

.05 

.008 

.022 

.028 

.022 

.022 

.034 

.036 

.006 

.006 

.052 

.024 

.012 

.050 

.064 

.01 

.000 

.002 

.002 

.006 

.006 

.006 

.006 

.000 

.000 

.006 

.002 

.002 

.010 

.012 

.10 

.092 

.100 

.096 

.126 

.120 

.126 

.114 

.156 

.148 

.118 

.202 

.214 

.136 

.140 

1 

-4 

.05 

.056 

.050 

.046 

.050 

.056 

.068 

.070 

.106 

.114 

.066 

.104 

.114 

.074 

.084 

.01 

.012 

.012 

.010 

.012 

.018 

.014 

.014 

.026 

.028 

.010 

.036 

.040 

.022 

.028 

.10 

.092 

.090 

.092 

.124 

.110 

.124 

.114 

.140 

.136 

.102 

.178 

.106 

.122 

.112 

I 

-5 

.05 

.052 

.050 

.044 

.050 

.054 

.064 

.066 

.078 

.084 

.052 

.086 

.088 

.068 

.072 

.01 

.012 

.012 

.008 

.012 

.010 

.014 

.012 

.012 

.016 

.004 

.030 

.034 

.014 

.024 

.10 

.092 

.100 

.096 

.126 

.120 

.126 

.114 

.140 

.138 

.106 

.182 

.190 

.126 

.118 

I 

-6 

.05 

.056 

.050 

.046 

.050 

.056 

.068 

.070 

.080 

.088 

.052 

.088 

.090 

.068 

.072 

.01 

.012 

.012 

.010 

.012 

.010 

.014 

.014 

.012 

.016 

.004 

.032 

.034 

.014 

.024 

.10 

.068 

.066 

.077 

.070 

.068 

.082 

.072 

.082 

.102 

.098 

.  104 

.098 

.098 

.096 

1 

-7 

.05 

.028 

.036 

.030 

.036 

.032 

.032 

.040 

.034 

.046 

.040 

.052 

.054 

.056 

.048 

.01 

.002 

.004 

.004 

.004 

.004 

.000 

.004 

.004 

.014 

.010 

.006 

.012 

.012 

.012 

.10 

.030 

.032 

.060 

.044 

.046 

.060 

.068 

.016 

.012 

.086 

.054 

.024 

.092 

.098 

1 

-8 

.05 

.012 

.020 

.030 

.030 

.014 

.034 

.036 

.008 

.004 

.054 

.024 

.012 

.040 

.062 

.01 

.000 

.000 

.004 

.004 

.000 

.006 

.010 

.000 

.000 

.018 

.008 

.004 

.022 

.024 

.10 

.014 

.018 

.028 

.030 

.016 

.036 

.034 

.008 

.002 

.020 

.018 

.010 

.024 

.038 

I 

-9 

.05 

.000 

.000 

.010 

.010 

.002 

.014 

.016 

.000 

.000 

.008 

.004 

.002 

.012 

.010 

.01 

.000 

.000 

.004 

.000 

.000 

.000 

.000 

.000 

.000 

.002 

.000 

.000 

.002 

.002 

TABLE  IV 
(continued) 


I 


I 


-10 


L-11 


10 

.034 

.030 

.038 

.036 

.040  .050 

05 

.008 

.010 

.018 

.014 

.010  .024 

01 

.000 

.000 

.000 

.002 

.002 

.000 

10 

.068 

.068 

.068 

.074 

.066 

.062 

,05 

.030 

.030 

.032 

.030 

.028  .028 

,01 

.006 

.004 

.010 

.002 

.002 

.006 

044 

.014 

.014 

.010 

.020 

.024 

.022 

.020 

022 

.006 

.008 

.002 

.006 

.004 

.008 

.010 

000 

.000 

.000 

.000 

.002 

.002 

.002 

.002 

,062 

.090 

.100 

.074 

.082 

.084 

.094 

.100 

,032  .052 

.044 

.034 

.038 

.048 

.052 

.046 

.006 

.014 

.008 

.002 

.006 

.012 

.010 

.006 

TABLE  V 

MONTE  CARLO  SIMULATION  OF  SIGNIFICANCE  LEVEL  FOR  INCOMPLETE  DATA 


SAMPLE  SIZE  N  =  20  Dispersion  Structure 


TEST 

Nominal  a 

4A 

4B 

4C 

40 

4E 

4F 

4G 

8A 

8B 

8C 

80 

8E 

8F 

8G 

.10 

.128 

.122 

.116 

.118 

.112 

.112 

.120 

.120 

.102 

.218 

.168 

.136 

.210 

.222 

I 

-1 

.05 

.062 

.064 

.062 

.066 

.062 

.068 

.080 

.062 

.046 

.152 

.112 

.078 

.154 

.170 

.01 

.010 

.020 

.016 

.020 

.012 

.020 

.030 

.016 

.014 

.080 

.036 

.032 

.076 

.088 

.10 

.050 

.108 

.064 

.066 

.082 

.080 

.068 

.022 

.064 

.110 

.084 

.086 

.110 

.118 

I 

-2 

.05 

.006 

.042 

.022 

.026 

.046 

.034 

.030 

.002 

.026 

.054 

.024 

.036 

.062 

.036 

.01 

.000 

.002 

.004 

.006 

.002 

.002 

.002 

.000 

.004 

.014 

.004 

.006 

.016 

.008 

.10 

.064 

.108 

.068 

.080 

.082 

.080 

.082 

.046 

.066 

.124 

.102 

.088 

.128 

.142 

1 

-3 

.05 

.024 

.042 

.028 

.032 

.046 

.038 

.046 

.022 

.026 

.074 

.042 

.038 

.070 

.076 

.01 

.000 

.002 

.008 

.006 

.002 

.004 

.006 

.000 

.004 

.016 

.008 

.006 

.020 

.016 

.10 

.152 

.162 

.102 

.136 

.156 

.106 

.116 

.196 

.210 

.132 

.184 

.204 

.152 

.162 

I 

-4 

.05 

.096 

.094 

.054 

.078 

.098 

.064 

.072 

.110 

.118 

.078 

.120 

.138 

.074 

.092 

.01 

.024 

.026 

.010 

.020 

.024 

.014 

.016 

.044 

.036 

.022 

.050 

.046 

.030 

.024 

.10 

.152 

.162 

.102 

.136 

.156 

.106 

.116 

.180 

.174 

.114 

.172 

.180 

.128 

.142 

I 

-5 

.05 

.096 

.092 

.054 

.074 

.096 

.064 

.068 

.092 

.100 

.062 

.106 

.120 

.066 

.078 

.01 

.024 

.026 

.008 

.020 

.024 

.014 

.016 

.032 

.024 

.020 

.034 

.042 

.024 

.016 

.10 

.152 

.162 

.102 

.136 

.156 

.106 

.116 

.182 

.178 

.116 

.176 

.182 

.128 

.148 

I 

-6 

.05 

.096 

.092 

.054 

.078 

.098 

.064 

.072 

.094 

.100 

.062 

.108 

.120 

.066 

.078 

.01 

.024 

.026 

.010 

.020 

.024 

.014 

.016 

.032 

.024 

.020 

.034 

.042 

.024 

.016 

.10 

.102 

.088 

.106 

.118 

.098 

.104 

.112 

.084 

.078 

.104 

.106 

.088 

.108 

.104 

1 

-7 

.05 

.048 

.048 

.050 

.062 

.054 

.052 

.056 

.034 

.036 

.056 

.050 

.036 

.050 

.048 

.01 

.006 

.002 

.012 

.006 

.008 

.012 

.010 

.006 

.000 

.012 

.012 

.002 

.014 

.012 

.10 

.054 

.058 

.066 

.066 

.060 

.074 

.086 

.026 

.028 

.094 

.062 

.044 

.094 

.114 

1 

-8 

.05 

.022 

.024 

.032 

.036 

.028 

.038 

.060 

.012 

.010 

.064 

.028 

.022 

.066 

.076 

.01 

.004 

.004 

.008 

.008 

.002 

.004 

.008 

.004 

.004 

.034 

.008 

.002 

.032 

.028 

.10 

.034 

.036 

.034 

.050 

.038 

.048 

.064 

.014 

.012 

.048 

.024 

.022 

.054 

.046 

I 

-9 

.05 

.008 

.010 

.014 

.018 

.010 

.016 

.018 

.008 

.008 

.022 

.008 

.002 

.020 

.024 

.01 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.002 

.002 

.000 

.004 

.004 

.10 

.056 

.056 

.068 

.060 

.052 

.060 

.064 

.032 

.026 

.036 

.056 

.058 

.044 

.048 

I 

-10 

.05 

.030 

.028 

.026 

.034 

.030 

.028 

.032 

.012 

.012 

.024 

.0l8 

.022 

.028 

.026 

.01 

.004 

.004 

.004 

.006 

.004 

.004 

.008 

.000 

.002 

.006 

.002 

.002 

.010 

.006 

.10 

.088 

.092 

.094 

.102 

.084 

.106 

.110 

.090 

.078 

.088 

.102 

.078 

.084 

.104 

I-ll 


table  V 

( continued) 


I- 

11 

.05 

.042 

.036 

.052 

.062 

.042 

.056 

.048 

.038 

.036 

.048 

.044 

.034 

.058 

.060 

.01 

.006 

.004 

.012 

.^08 

.008 

.006 

.0C« 

.006 

.006 

.014 

.010 

.008 

.016 

.012 

TABLE  VI 

MONTE 

CARLO  SIMULATION  Of 

POWER  FOR  INCOMPLETE  DATA 

SAMPLE  SIZE 

N  =  10 

TEST 

Nominal  a 

4A 

4B 

4C 

40 

4E 

AT 

4G 

8A 

BB 

8C 

80 

8E 

8F 

BG 

.10 

1.00 

.886 

.988 

.922 

.508 

.764 

.922 

1.00 

.982 

.926 

.822 

.452 

.492 

.654 

I 

-1 

.05 

,.00 

.782 

.978 

.856 

.398 

.706 

.872 

1.00 

.958 

.892 

.756 

.352 

.430 

.604 

.01 

1.00 

.510 

.922 

.650 

.188 

.456 

.716 

1.00 

.848 

.820 

.582 

.170 

.304 

.476 

.10 

1.00 

.828 

.968 

.818 

.418 

.648 

.846 

1.00 

.920 

.826 

.568 

.254 

.302 

.458 

I 

-2 

.05 

1.00 

.652 

.904 

.600 

.252 

.448 

.674 

1.00 

.768 

.682 

.324 

.102 

.152 

.310 

.01 

.994 

.252 

.516 

.186 

.076 

.140 

.264 

1.00 

.304 

.282 

.052 

.012 

.026 

.062 

.10 

1.00 

.838 

.974 

.850 

.432 

.700 

.860 

1.00 

.952 

.846 

.700 

.304 

.368 

.536 

I 

.05 

1.00 

.674 

.932 

.676 

.266 

.504 

.734 

1.00 

.858 

.760 

.498 

.152 

.244 

.426 

.01 

.994 

.294 

.666 

.314 

.090 

.200 

.368 

1.00 

.480 

.488 

.202 

.020 

.086 

.216 

.10 

1.00 

.668 

.966 

.800 

.360 

.698 

.892 

1.00 

.494 

.768 

.440 

.258 

.314 

.466 

I 

-A 

.05 

1.00 

.496 

.954 

.670 

.240 

.584 

.822 

1.00 

.314 

.670 

.304 

.160 

.210 

.354 

.01 

.998 

.198 

.838 

.412 

.098 

.328 

.626 

.976 

.114 

.478 

.122 

.054 

.104 

.188 

.10 

1.00 

.664 

.966 

.796 

.356 

.696 

.892 

1.00 

.438 

.746 

.404 

.232 

.280 

.438 

1 

-5 

.05 

1.00 

.488 

.954 

.662 

.238 

.582 

.816 

.998 

.252 

.616 

.262 

.126 

.180 

,310 

.01 

.998 

.186 

.830 

.394 

.090 

.326 

.612 

.964 

.082 

.404 

.082 

.038 

.078 

.144 

.10 

1.00 

.668 

.966 

.800 

.360 

.698 

.892 

1.00 

.440 

.748 

.410 

.236 

.280 

.442 

1 

-6 

.05 

1.00 

.496 

.954 

.670 

.240 

.584 

.822 

,998 

.258 

.620 

.262 

.128 

.188 

.312 

.01 

.998 

.198 

.838 

.412 

.098 

.328 

.626 

,968 

.082 

.412 

.086 

.040 

.082 

.156 

.10 

.992 

.486 

.852 

.576 

.260 

.472 

.704 

1.00 

.894 

.748 

.570 

.332 

.242 

.418 

I 

-7 

.05 

.982 

.336 

.742 

.416 

.152 

.336 

.522 

1.00 

.796 

.590 

.420 

.218 

.160 

.258 

.01 

.876 

.124 

.586 

.170 

.050 

.122 

.236 

.998 

.498 

.280 

.168 

.056 

.044 

.092 

.10 

.992 

.436 

.908 

.584 

.198 

.562 

.784 

1.00 

.834 

.816 

.644 

.256 

.324 

.522 

I 

-8 

.05 

.988 

.332 

.848 

.462 

.132 

.422 

.690 

1.00 

.752 

.762 

.528 

.160 

.246 

.440 

.01 

.948 

.  144 

.690 

.266 

.064 

.228 

.454 

1.00 

.550 

.628 

.336 

.074 

.146 

.314 

.10 

.984 

.352 

.836 

.468 

.150 

.422 

.660 

1.00 

.722 

.658 

.440 

.136 

.168 

.338 

1 

-9 

.05 

.956 

.196 

.728 

.308 

.082 

.266 

.484 

1.00 

.548 

.500 

.294 

.066 

.086 

.182 

.01 

.790 

.058 

.372 

.108 

.012 

.098 

.204 

.990 

.200 

.178 

.096 

.004 

.014 

.048 

.10 

.970 

.260 

.830 

.380 

.106 

.404 

.680 

.976 

.112 

.564 

.116 

.038 

.124 

.204 

I 

-10 

.05 

.934 

.140 

.734 

.274 

.064 

.284 

.532 

.960 

.056 

.460 

.050 

.012 

.078 

.144 

.01 

.786 

.040 

.478 

.116 

.016 

.136 

.286 

.834 

.010 

.216 

.010 

.002 

.014 

.056 

.10 

.974 

.354 

.744 

.446 

.196 

.370 

.538 

.764 

.296 

.192 

.146 

.128 

.104 

.126 

1 

11 

.05 

.922 

.226 

.576 

.288 

.108 

.214 

.364 

.524 

.162 

.096 

.064 

.072 

.052 

.068 

.01 

.682 

.080 

.248 

.090 

.020 

.070 

.146 

.162 

.032 

.010 

.018 

.014 

.014 

.018 

i 


TABLE  VII 

MONTE  CARLO  SIMULATION  Of  POWER  FOR  INCOMPLETE  DATA 
SAMPLE  SIZE  N  :  20 

Nominal  a  4A  4B  4C  40  4E  4r  AG  BA  SB  BC  8D  BE  8F  BG 


10 

1, 

.00 

.998 

1.00 

1.00 

.820 

.970 

.998 

1 

.00 

1.00 

1.00 

.984 

.780 

.770 

.922 

05 

1, 

.00 

.996 

1.00 

.996 

.736 

.936 

.994 

1 

.00 

1.00 

1.00 

.978 

.676 

.732 

.898 

01 

1. 

.00 

.972 

1.00 

.970 

.498 

.858 

.978 

1 

.00 

1.00 

.998 

.926 

.444 

.568 

.834 

10 

1, 

.00 

.996 

1.00 

.996 

.796 

.940 

.996 

1 

.00 

1.00 

1.00 

.958 

.678 

.660 

.866 

05 

1. 

.00 

.994 

1.00 

.972 

.678 

.862 

.974 

1 

.00 

1.00 

.986 

.894 

.530 

.480 

.752 

01 

1 

.00 

.946 

.994 

.854 

.374 

.624 

.850 

1 

.00 

.998 

.898 

.624 

.244 

.240 

.432 

10 

1 

.00 

.996 

1.00 

.996 

.796 

.944 

.996 

1 

.00 

1.00 

1.00 

.966 

.682 

.676 

.876 

,05 

1 

.00 

.994 

1.00 

.984 

.678 

.876 

.980 

1 

.00 

.100 

.988 

.918 

.536 

.510 

.786 

01 

1, 

.00 

.948 

.994 

.896 

.374 

.680 

.898 

1 

.00 

.998 

.928 

.738 

.248 

.276 

.552 

10 

1 

.00 

.964 

1.00 

.984 

.598 

.930 

.998 

1 

.00 

.718 

.980 

.594 

.296 

.526 

.736 

05 

1 

.00 

.918 

1.00 

.952 

.458 

.888 

.982 

1 

.00 

.580 

.958 

.470 

.206 

.404 

.634 

01 

1, 

.00 

.694 

1.00 

.842 

.208 

.748 

.950 

1 

.00 

.292 

.882 

.256 

.098 

.226 

.450 

10 

1 

.00 

.964 

1.00 

.984 

.598 

.930 

.998 

1 

.00 

.686 

.976 

.554 

.270 

.512 

.702 

05 

1, 

.00 

.918 

1.00 

.952 

.452 

.886 

.982 

1 

.00 

.536 

.956 

.424 

.180 

.358 

.604 

01 

1 

.00 

.686 

1.00 

.842 

.206 

.742 

.950 

1 

.00 

.248 

.850 

.192 

.070 

.172 

.398 

10 

1 

.00 

.964 

l.QO 

.984 

.598 

.930 

.998 

1 

.00 

.692 

.978 

.556 

.274 

.512 

.708 

05 

1 

.00 

.918 

1.00 

.952 

.454 

.888 

.982 

1 

.00 

CD 

.958 

.428 

.186 

.360 

.604 

01 

1 

.00 

.692 

1.00 

.842 

.208 

.746 

.950 

1 

.00 

.250 

.852 

.192 

.074 

.172 

.402 

10 

1 

.00 

,812 

.998 

.884 

.470 

.826 

.968 

1 

.00 

.998 

.982 

.916 

.702 

.522 

.764 

05 

1, 

.00 

.700 

.994 

.804 

.332 

.738 

.930 

1, 

.00 

.992 

.960 

.846 

.564 

.360 

.648 

01 

1, 

.00 

.448 

.940 

.590 

.152 

.466 

.770 

1, 

,00 

.972 

.858 

.638 

.290 

.150 

.356 

10 

1, 

.00 

.772 

1.00 

.884 

.430 

.842 

.972 

1, 

,00 

.998 

.992 

.928 

.532 

.630 

.850 

05 

1, 

.00 

.678 

.994 

.818 

.306 

.780 

.960 

1, 

.00 

.988 

.984 

.898 

.434 

.558 

.BOB 

01 

1, 

.00 

.486 

.982 

.646 

.150 

.612 

.872 

1, 

.00 

.966 

.972 

.788 

.258 

.422 

.678 

10 

1, 

.00 

.728 

.996 

.848 

.360 

.798 

.954 

1, 

.00 

.984 

.976 

.880 

.440 

.490 

.738 

05 

1, 

.00 

.586 

.986 

.748 

.240 

.678 

.906 

1, 

,00 

.972 

.938 

.794 

.318 

.304 

.580 

01 

1, 

.00 

.332 

.918 

.464 

.082 

.392 

.710 

1, 

,00 

.870 

.764 

.556 

.100 

.102 

.292 

10 

1, 

.00 

.574 

.996 

.750 

.258 

.762 

.948 

1, 

.00 

.424 

.936 

.358 

.098 

.334 

.566 

05 

1, 

.00 

.452 

.986 

.630 

.156 

.662 

.902 

1 

,00 

.296 

.908 

.242 

.068 

.204 

.448 

01 

.998 

.218 

.928 

.374 

.042 

.418 

.778 

1, 

,00 

.092 

.786 

.094 

.010 

.086 

.258 

10 

1. 

.00 

.712 

.990 

.834 

.384 

.760 

.938 

1. 

.00 

.934 

.848 

.658 

.392 

.272 

.468 

05 

1. 

,00 

.596 

.972 

.722 

.266 

.640 

.874 

1. 

.00 

.844 

.710 

.512 

.258 

.156 

.314 

01 

.99B 

.332 

.884 

.446 

.086 

.356 

.646 

1. 

.00 

.640 

.386 

.230 

.062 

.034 

.110 

TABLE  VIII 

MONTE  CARLO  SIMULATION  OF  POWER  FOR  COMPLETE  DATA 
DISPERSION  STRUCTURE 


EST 

N 

Nominal  a 

4A 

4B 

4C 

4D  4C  4F  4G 

8A 

8B 

BC 

80 

8E 

8F 

8G 

10 

.10 

1.00  .978 

.996 

.982  .674  .882  .980 

1.00 

.998 

.974 

.898 

.564 

.564 

.746 

C-1 

10 

.05 

1.00 

.944 

.992 

.964  .538  .818  .962 

1.00 

.994 

.954 

.858 

.446 

.502 

.710 

10 

.01 

1.00 

.830 

.984 

.874  .292  .640  .884 

1.00 

.974 

.898 

.726 

.232 

.370 

.582 

10 

.10 

1.00 

.966 

.992 

.974  .614  .826  .960 

1.00 

.992 

.918 

.848 

.422 

.428 

.632 

C-2 

10 

.05 

1.00 

.916 

.986 

.924  .440  .706  .908 

1.00 

.982 

.852 

.726 

.264 

.328 

.500 

10 

.01 

1.00 

.698 

.924 

.740  .198  .426  .682 

1.00 

.840 

.626 

.452 

.707 

.150 

.296 

10 

.10 

1.00 

.706 

.996 

.846  .314  .768  .958 

1.00 

.278 

.792 

.232 

.128 

.240 

.426 

C-3 

10 

.05 

1.00 

.522 

.988 

.722  .218  .668  .910 

.996 

.168 

.696 

.142 

.088 

.164 

.284 

10 

.01 

1.00 

.258 

.952 

.428  .094  .402  .738 

.958 

.066 

.462 

.066 

.044 

.056 

.132 

10 

.10 

1.00 

.704 

.996 

.846  .314  .768  .958 

l.DO 

.238 

.772 

.212 

.112 

.214 

.390 

C-4 

10 

.05 

1.00 

.516 

.986 

.720  .212  .660  .906 

.996 

.152 

.656 

.118 

.078 

.146 

.234 

10 

.01 

1.00 

.248 

.946 

.416  .092  .396  .732 

.934 

.042 

.380 

.048 

.030 

.042 

.098 

10 

.10 

1.00 

.910 

.972 

.838  .520  .580  .790 

.966 

.4’;' 

.258 

.220 

.166 

.128 

.166 

C-5 

10 

.05 

1.00 

.782 

.888 

.722  .338  .400  .656 

.846 

.226 

.156 

.108 

.080 

.072 

.098 

10 

.01 

1.00 

.392 

.566 

.348  .110  .142  .300 

.312 

.062 

.026 

.028 

.012 

.020 

.022 

10 

.10 

1.00 

.974 

.994 

.926  .616  .672  .896 

l.DO 

.994 

.848 

.776 

.530 

.258 

.444 

C-6 

10 

.05 

1.00 

.906 

.970 

.838  .462  .546  .782 

1.00 

.984 

.720 

.606 

.344 

.148 

.296 

10 

.01 

1.00 

.616 

.758 

.524  .180  .212  .446 

1.00 

.860 

.352 

.276 

.124 

.034 

.098 

20 

.10 

1.00  1.00 

1.00 

1.00  .914  .984  1.00 

1.00 

1.00 

1.00 

.996 

.866 

.818 

.946 

C-1 

20 

.05 

1.00  1.00 

1.00 

1.00  .872  .978  1.00 

1.00 

1.00 

.998 

.984 

.806 

.768 

.934 

20 

.01 

1.00 

.996 

1.00 

.998  .694  .930  .998 

1.00 

1.00 

.998 

.968 

.594 

.656 

.890 

20 

.10 

1.00  1.00 

1.00 

1.00  .902  .980  1.00 

1.00 

l.DO 

.998 

.986 

.834 

.732 

.920 

C-2 

20 

.05 

1.00 

.998 

1.00 

1.00  .834  .946  1.00 

1.00 

1.00 

.998 

.970 

.706 

.626 

.858 

20 

.01 

1.00 

.992 

1.00 

.994  .612  .866  .966 

1.00 

1.00 

.976 

.912 

.410 

.416 

.692 

20 

.10 

1.00 

.980 

1.00 

.990  .562  .960  1.00 

1.00 

.536 

.992 

.454 

.202 

.474 

.752 

C-3 

20 

.05 

1.00 

.926 

1.00 

.974  .414  .940  .998 

1.00 

.392 

.982 

.326 

.138 

.356 

.634 

20 

.01 

1.00 

.724 

1.00 

.880  .176  .824  .982 

1.00 

.180 

.916 

.156 

.050 

.172 

.418 

20 

.10 

1.00 

.976 

1.00 

.990  .558  .960  1.00 

1.00 

.492 

.99U 

.428 

.192 

.448 

.734 

C-4 

20 

.05 

1.00 

.926 

1.00 

.968  .414  .940  .998 

1.00 

.352 

.974 

.292 

.118 

.324 

.600 

20 

.01 

1.00 

.722 

1.00 

.880  .172  .822  .982 

1.00 

.132 

.896 

.122 

.038 

.134 

.364 

20 

.10 

1.00 

.998 

1.00 

1.00  .892  .918  .996 

1.00 

1.00 

.936 

.870 

.608 

.320 

.542 

C-5 

20 

.05 

1.00 

.998 

.998 

.992  .790  .856  .976 

1.00 

1.00 

.852 

.744 

.406 

.172 

.362 

20 

.01 

1.00 

.974 

.992 

.926  .538  .618  .896 

1.00 

.972 

.576 

.384 

.196 

.058 

.130 

20 

.10 

1.00  1.00 

1.00 

1.00  .936  .956  1.00 

1.00 

1.00 

.998 

.986 

.892 

.564 

.818 

C-6 

20 

.05 

1.00 

.998 

1.00 

1.00  .878  .906  .994 

1.00 

1.00 

.992 

.964 

.816 

.412 

.726 

20 

.01 

1.00 

.990 

.994 

.966  .642  .740  .952 

1.00 

1.00 

.934 

.816 

.548 

.162 

.470 

TABLE  IX 

MONTE  CARLO  SIMOLATION  OF  SIGNIFICANCE  LEVEL  FOR  COMPLETE  DATA 
DISPERSION  SIRUCTORE 


TEST 

N 

Nominal  a 

AA 

AB 

AC 

AD 

AE  AF  AG  8A 

SB 

8C 

8D 

8E 

8F 

8G 

10 

.10 

.088 

.lOA 

.llA 

.130 

.HA  .132  .1A2  .096 

.090 

.236 

.152 

.098 

.254 

.254 

C-1 

10 

.05 

.0A6 

.0A8 

.072 

.080 

.06A  .082  .088  .050 

.032 

.160 

.094 

.054 

.178 

.190 

10 

.01 

.012 

.012 

.022 

.OlA 

.012  .026  .030  .OOA 

.008 

.088 

.032 

.014 

.096 

.116 

10 

.10 

.062 

.068 

.08  A 

.106 

.092  .09A  .108  .052 

.036 

.132 

.082 

.056 

.128 

.152 

C-2 

10 

.05 

.032 

.03A 

.OAO 

.0A8 

.OAO  .050  .050  .010 

.014 

.072 

.032 

.026 

.074 

.078 

10 

.01 

.006 

.006 

.OlA 

.006 

.008  .012  .OlA  .000 

.000 

.022 

.012 

.004 

.020 

.016 

10 

.10 

.092 

.096 

.100 

.106 

.116  .106  .122  .12A 

.122 

.096 

.126 

.124 

.102 

.098 

C-3 

10 

.05 

.056 

.052 

.052 

.058 

.05A  .05A  .062  .086 

.086 

.050 

.070 

.074 

.054 

.062 

10 

.01 

.012 

.010 

.OlA 

.018 

.018  .016  .010  .02A 

.026 

.012 

.034 

.034 

.022 

.014 

10 

.10 

.092 

.09A 

.096 

.106 

.116  .106  .120  .lOA 

.112 

.080 

.108 

.110 

.088 

.090 

C-A 

10 

.05 

.052 

.052 

.050 

.058 

.05A  .05A  .062  .068 

.072 

.036 

.066 

.068 

.054 

.052 

10 

.01 

.012 

.008 

.OlA 

.018 

.012  .016  .010  .012 

.018 

.008 

.022 

.028 

.010 

.006 

10 

.10 

.09A 

.100 

.092 

.098 

.092  .110  .112  .098 

.092 

.082 

.102 

.088 

.104 

.112 

C-5 

10 

.05 

.05A 

.052 

.050 

.0A8 

.05A  .056  .06A  .038 

.042 

.034 

.050 

.048 

.050 

.044 

10 

.01 

.008 

.016 

.OlA 

.012 

.010  .012  .012  .OlA 

.008 

.008 

.006 

.010 

.014 

.010 

10 

.10 

.09A 

.088 

.098 

.092 

.09A  .106  .106  .09A 

.090 

.092 

.096 

.082 

.092 

.102 

C-6 

10 

.05 

.0A8 

.058 

.0A8 

.05A 

.050  .056  .062  .038 

.036 

.OAA 

.040 

.042 

.038 

.046 

10 

.01 

.002 

.008 

.016 

.008 

.006  .OlA  .020  .012 

.010 

.008 

.006 

.004 

.008 

.014 

20 

.10 

.100 

.106 

.126 

.llA 

.108  .116  .126  .122 

.114 

.220 

.166 

.138 

.226 

.244 

C-1 

20 

.05 

.050 

.052 

.07A 

.058 

.058  .076  .086  .066 

.060 

.158 

.092 

.078 

.174 

.186 

20 

.01 

.010 

.OlA 

.010 

.016 

.OlA  .030  .028  .020 

.016 

.092 

.044 

.02^ 

.090 

.108 

20 

.10 

.090 

.092 

.092 

.09A 

.092  .094  .096  .090 

.086 

.132 

.114 

.100 

.142 

.162 

C-2 

20 

.05 

.0A2 

.0A2 

.052 

.0A2 

.OAA  .0A8  .056  .OAA 

.034 

.072 

.058 

.044 

.072 

.100 

20 

.01 

.006 

.008 

.008 

.OlA 

.OOA  .012  .OlA  .010 

.008 

.026 

.022 

.012 

.034 

.034 

20 

.10 

.108 

.12A 

.100 

.112 

.116  .108  .102  .13A 

.140 

.104 

.140 

.134 

.116 

.126 

C-3 

20 

.05 

.060 

.056 

.0A8 

.066 

.076  .050  .060  .092 

.094 

.058 

.088 

.086 

.056 

.060 

20 

.01 

.02  A 

.02A 

.010 

.OlA 

.020  .008  .012  .030 

.024 

.016 

.024 

.028 

.020 

.024 

20 

.10 

.108 

.120 

.100 

.112 

.llA  .106  .102  .12A 

.130 

.094 

.128 

.122 

.106 

.104 

C-A 

20 

.05 

.060 

.056 

.0A6 

.066 

.072  .050  .060  .080 

.068 

.050 

.066 

.070 

.048 

.050 

20 

.01 

.02A 

.02A 

.010 

.OlA 

.020  .008  .012  .018 

.020 

.012 

.020 

.022 

.016 

.018 

20 

.10 

.098 

.100 

.098 

.088 

.102  .098  .092  .106 

.114 

.108 

.112 

.110 

.110 

.106 

C-5 

20 

.05 

.058 

.052 

.058 

.038 

.058  .OAA  .0A6  .062 

.064 

.060 

.064 

.058 

.056 

.046 

20 

.01 

.012 

.OlA 

.012 

.OlA 

.012  .014  .010  .010 

.010 

.010 

.006 

.010 

.010 

.006 

20 

.10 

.098 

.096 

.102 

.098 

.092  .102  .110  .122 

.104 

.094 

.116 

.126 

.098 

.104 

C-6 

20 

.05 

.062 

.0A8 

.05A 

.038 

.046  .056  .058  .070 

.052 

.040 

.052 

.066 

.046 

.050 

20 

.01 

.010 

.012 

.008 

.OlA 

.014  .010  .008  .012 

.022 

.016 

.020 

.020 

.012 

.010 
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The  utility  of  the  EH  algorithm  in  fitting  mixed  ANCAn  models  is  discussed.  Issues  addressed 
range  from  practical  prograirmlng  considerations  to  the  suitability  of  the  EH  technique  for  the 
inclusion  of  eitplrical  or  investigator  Bayesian  prior  information  into  the  estimates  of  fixed 
effects  and  variance  conponents.  The  class  of  MKNh  models  considered  are  appropriate  in  many 
longitudinal  problems  including  growth  curve  and  repeated  measures  analysis  with  arbitrary 
patterns  of  missing  data.  An  example  of  growth  curve  modeling  is  used  as  an  illustration  of  the 
estimation  techniques  arid  model  specification  issues  —  and  for  the  purposes  of  conparing  the 


approach  with  a  simpler  ‘two-stage*  analysis. 


1.  INTRCDOCnCW. 

This  paper  discusses  the  use  of  the  EH 
algorithm  for  fitting  a  subclass  of  mixed 
(fixed  and  random  effects)  linear  models  to 
longitudinal  data.  The  class  of  models 
considered  Includes  growth  curves  as  Inportant 
special  cases.  We  Illustrate  growth  curve 
modeling  with  an  example  taken  from  an  energy 
conservation  study,  which  serves  to  Illustrate 
the  general  principles  of  the  longitudinal  mixed 
model  approach.  Also  discussed  is  the  rate  of 
convergence  of  the  EM  algorithm  in  this  variance 
component  setting.  Simple  approaches  to 
spewing  the  convergence  of  the  algorithm  are 
described  and  illustrated. 


2.  ™e  cuss  of  LCNGITWINAL  VaOELS. 

The  class  of  models  considered  here,  which 
term  'longitudinal  random  effects'  models 
(laird  and  Ware  19<2)  may  be  written,  as  a 
representation  for  ni  different  responses  for 
the  ith  subject,  in  Uie  form 

-  Xja  +  (i-l . m)  (1) 

Here  Xj  and  Z,  are  known  design  matrices  (of 
order  n^xp  and  njxq  respectively) ,  ^  is  a  pxl 
unknoMi  vector  or  fixed  effects,  pj  is  the  qxl 
vector  of  random  effects  for  the  ith  subject, 
which  we  assume  to  be  multivariate  normally 
distributed  as  N(0,D)  independently  of  and 
for  i  4  j.  The  'intra-subject'  error  term  ,i  is 
assumed  to  be  normal,  N(0,gil).  The  parameters 
of  the  model  which  are  to  be  estimated  are  (Jien 
the  vector  of  fixed  effects,  a,  and  the  variance 
components,  namely  «»  and  the  (q+I)q/2  distinct 
elements  of  D.  In  addition  one  often  considers 
the  estimation  of  the  random  effects,  pj, 
themaelves  for  the  purposes  of  residual  analysis 
and  assessing  the  Influence  of  outliers.  The 
U(E  class  of  imodels  is  characterized  by  the 
nestirra  of  the  random  effects  within  subject. 


The  model  imposes  a  specific  form  on  the 
covariance  structure  of  the  distribution  of  the 
y^.  That  is,  the  model  for  the  independent  yi 
vectors  is  multivariate  normal  with  means  Xjo 
and  covariance  matrix  -  o*I  +  Z^DZ^'. 

Growth  curves  can  be  considered  as  a 
special  subclass  of  these  models  characterized 
by  a  linear  relationship  between  the  columns  of 
the  X^  and  Z^  matrices  which  we  nay  write  as 

Xj  -  Z^A^, 

with  A^  a  known  matrix. 


3.  AN  EXAHPU:  OF  (3SJWTH  CURVE  DATA. 


'Oie  Princeton  'Modular  Retrofit  Experiment'. 

The  data  used  as  an  illustration  here  are 
from  an  experiment  in  wiergy  conservation 
conducted  by  Princeton  University's  Center  for 
Ehergy  and  Envlronnentza  Studies  (Dutt  et.  al. 
1982).  In  the  late  1970'b  the  Center  organized  a 
study  which  sought  to  measure  the  impact  of  two 
levels  of  conservation  activities  on  energy 
utilization  in  preexisting  single  family  Nfew 
Jersey  housing  (Dutt  et.  al.  1982).  The  levels 
of  so-called  energy  'retrofit'  activity  were: 

1.  'House-doctor' 

2 .  'Major-retrofit ' 


The  House-doctor  level  Involved  a  single 
Jay  visit  by  personnel  trained  in  making 
relatively  inexpensive  repairs  to  ventilation, 
heating,  and  insulation  systems.  The  major- 
retrofit  level  included  the  house-doctor 
treatment  and  the  addition  of  attic  and  wall 
insulation.  To  test  the  efficacy  of  these  two 
retrofit  regimens  a  total  of  138  New  Jersey 


houses  heated  with  natural  gas  were  enrolled  in 
the  study  known  as  the  'Modular  Retrofit 
Experiirent*  (IS<E)  and  were  randomly  assigned  to 
one  of  the  treatment  groups  —  control,  where  no 
actions  were  performed  by  the  study,  house- 
doctor,  and  major-retrofit.  With  the 
cooperation  of  participating  gas  utilities 
utility  billing  data  (usually  collected  on  a 
one-month  billing  cycle)  was  obtained  for  one 
year  prior  to  the  retrofit  and  house-doctor 
activity  (pre-intervention) .  Post-Intervention 
data  were  obtained  by  collecting  meter  reading 
data  for  an  additional  year  following  the 
retrofit  period.  In  the  subsequent  paragraphs 
we  consider  two  approaches  to  these  data  —  a 
two-stage  model  and  a  unified  longitudinal 
random  effects  model. 


3.2  l\K>-stage  analysis  of  the  HRE  data. 

A.  tM>-stage  analysis  of  the  MRE  data  can  be 
performed  In  the  following  fashion.  Let  y4.s|(, 
j‘0,1,  k~l.....nj^j,  1-1,..., 13B  be  the  average 
daily  natural  ga8''consunptions  by  the  1th  house 
for  the  kth  meter  reading  in  the  jth  period  (j-0 
for  pre-period  and  j-1  for  post).  For  each 
house  in  each  period  (pre  and  post)  models  of 
the  following  form  were  fit,  using  least 
squares. 


3.3  Ibe  longitudinal  random  effects  approach  to 
the  HRE  data. 


An  alternative  to  the  two-stage  analysis  Is 
the  application  of  a  growth  curve  analysis  «dilch 
fits  a  single  overall  model  for  all  the 
consunptlon  data  in  the  experiment.  The  way  of 
writing  such  a  U(E  model  which  seems  most 
analogous  to  t)ie  two-stage  model  Is  as 


Pi  « 


where  Is  of  form 

If.o  ®i,o  ®®i.o  ®i,o 

3i.l  3i.l  ?®i.l  ?®l.l 

and  equals  Z^A^  with  A^  equal  to 


yijk 


*ij  *  ‘*ij“®ijk  * 


'ijk 


(2) 


Here  a^j  is  the  heating-insensitive  or  'base- 
level'  consunptlon  for  iiouse  1  in  period  j,  bij 
is  the  weather  sensitive  'heating-slope'  and 


is  the  average  daily  heating  degree-days 
observed  for  meter  reading  period  k  for  house  1 


in  period  j.  Once  a.^  and  b,j  were  obtained  by 
least  squares  the  effects  of'^the  levels  of 
retrofit  activity  on  the  Intercept  and  heating 
slope  over  the  experiment  as  a  whole  were 
assessed  by  calculating  the  differences 


and 


A*ij  ”  *12  "  *11 

A^ij 


^12  "  *’11 


and  fitting  two  separate  univariate  ANOVA  models 
to  tl)ese  data  of  form 


AOij  -  Mo  +  MiHi  +  MjRi  +  .i 
and  (3) 


Ab 


ij  “  To  *  TlHi  +  12^1  +  •! 


where  H,  and  R,  are  dunny  variables  indicating 
meitbership  in  tiie  house-doctor  and  major 


retrofit  groups  respectively. 


10000000 
0  1  Hi  R,  0  0  0  0 

0  0  0  0  1  0  0  0 

0  0  0  0  0  1  HiRiJ 


Here  yi  consists  of  all  the  Oi  „+ni  ^ 
consunptlon  readings  for  Ixiuse'i.  il  4  is  an 
n^jxl  vector  of  Is,  Oi  4,  Is  a  nijXi'wtor  of 
Os'^and  HDD,  ^  Is  an  n.^xl  vector  of  average 
daily  heating  degree  mys  for  tlie  meter  reixling 
periods  in  period  j  (j-0,1)  for  house  1,  and  Hi 
and  R^  are  dumny  variables  Indicating  menbership 
In  the  two  treatment  groups.  The  random 
effects,  a.  in  model  (4)  eure  a,,  Aa^,  b^.  and 
Abj^.  which  are  tlie  individual  ^se  pre-period 
base-levels,  clwuige  In  base-level,  from  pre-  to 
post-periods,  pre-period  heating-slope  arxi 
chrmge  in  heating-slope,  respectively.  The 
fixed  effects  are  a,  an  overall  mean  of  pre¬ 
period  base-level  consunptlon.  Mq,  Mi,  M2>  ^ich 
are  tiie  overall  means  of  changes  in  uie  base- 
level  for  the  control,  house-doctor,  and  major- 
retrofit  groups,  respectively.  The  remaining 
fixed  effect  parameters  for  lieating-slope.  b, 

Tq,  Tf  snd  rj,  are  analogous  to  the  parameters 
a.Mg,  Mi.and  m2  Cot  base-level. 


3.4  Which  analysis  Is  preferable.  lAE  or  two- 
stageT 


Hie  longitudinal  random  effects  estlrates 
of  the  treatment  responses.  Mi  .and  m2> 

Iq.  ri*  ■"ay  be  thought  of  as  optlnally 

vielghted  versions  of  estimates  of  the  same 
parameters  in  the  second  stage  ANOUAs  of  model 
(3).  In  particular  if  every  house  in  the  MRE 
experiment  were  to  have  the  same  matrix  for 
the  random  effects  (that  is.  the  same  nunber  of 
meter-reads  and  tlie  same  heating  degree-days  in 
each  neter-readlng  period)  then  the  two-stage 
and  the  U(E  anedyses  would  be  essentially 
equivalent.  In  the  MRE  study  this  was  not  the 
case.  The  number  of  meter  reading  periods  wheie 
data  were  observed  for  each  house  varied,  as  did 
the  timing,  meaning  that  the  dates  of  the 
beginning  and  end  of  each  period  and  hence  the 
heating  degree-days  differed  for  each  house. 
While  the  ideal  nuntier  of  readings  vas  twenty- 
four  for  e2K:h  house,  corresponding  to  two  one- 
year  lntet^als.  about  m  of  these  data  were 
missing. 

The  two-stage  analysis  makes  no  allowances 
for  different  variances  in  the  intra-subject 
parameter  estimates  (model  2)  either  due  to 
missing  data  or  to  differences  in  heating 
degree-days  across  houses.  For  example,  a  house 
with  missing  data  in  the  summer  will  have  a  less 
reliable  estimate  of  Aa.  than  one  having  a  full 
oonplement  of  data.  Although  the  two-stage 
analysis  does  not  take  this  information  into 
account,  the  random  assignment  of  houses  to  the 
treatments  Insures  that  such  data  can  be 
regarded  as  ancillary  to  the  experiment,  as  long 
as  the  causes  of  missing  data  are  independent  of 
y.  .  The  size  of  the  test  of  hypothesis 
coiK:eming  the  effect  of  the  treatment  should  be 
correct  for  the  two-stage  analysis.  The  test 
will,  however,  have  less  power  than  for  the 
mixed  model  approach  —  if  the  longitudinal 
random  effects  model  is  correct. 

Note  that  the  assumptions  underlying  the 
IRE  model  (1)  are  more  restrictive  than  for  the 
two-stage  analysis.  In  particular  the  intra- 
subject  error  variances.  «*.  as  we  have 
specified  the  model  here,  are  assumed  to  be  the 
same  for  all  the  houses.  For  the  Modular 
Retrofit  Experiment  this  is  a  dubious 
assumption.  For  example,  the  R-squares  for  the 
individual  house  heating  degree-day  models  range 
from  above  0.98  at  the  highest  down  to  0.80  for 
the  lowest.  This  tends  to  imply  considerable 
differences  in  the  Intra-srtoject  error  variances 
from  subject  to  sidiject.  Therefore  an  Important 
research  question  raised  by  the  application  of 
the  IRE  model  to  this  data  set  is  just  how 
sensitive  the  size  and  power  of  hypothesis  tests 
based  on  these  models  will  be  to  this  particular 
source  of  model  misspeciflcatlon. 

The  two-stage  analysis,  on  the  other  hand, 
is  Imnune  to  differences  in  the  intra-subject 
error  variances  because  of  the  random  assignment 


of  subjects  to  treatment.  The  assumptions 
required  for  the  two-stage  analysis  is  that  the 
unconditional  distribution  of  the  error  terms  in 
model  (3)  are  homoscedastlc  and  Gaussian.  Thus 
information  collected  in  the  course  of  fitting 
the  individual  s(d}ject  models  may  be  regard  as 
ancillary  to  the  experiment  and  may  be  ignored 
%fithout  biasing  the  unconditional  size  of 
hypothesis  test  based  on  the  two-stage  approach. 


4.  ALOORITltllC  APFRCACH. 


4.1  The  Q1  algorithm. 

Once  it  is  decided  that  the  longitudinal 
random  effects  model  is  appropriate  for  these 
data  we  are  faced  with  the  problem  of  estimating 
the  variance  coitfx>nents  in  the  model,  namely  a* 
and  the  elements  of  D.  We  follow  t^lrd  and  Ware 
(1982)  in  employing  the  Q1  algorithtn  —  with 
certain  modifications  for  spewing  convergence, 
for  the  iterative  estimation  of  these 
parameters.  He  choose  the  SI  rdgorithn  for  t)ie 
following  reasons.  First,  idien  used  for  maxlmiim 
likeli)xx)d  estimation  the  SI  is  known  to  always 
increase  the  llkeli)xx3d  at  each  stage  of  the 
iterations  (see  Dempster  Laird  and  Rubin.  1977). 
Second  for  t)ie  IRE  model  (1)  it  ))as  a  very 
simple  interpretation  and  implementation  in 
terms  of  the  unobservable  random  effects, 

Third,  forms  of  prior  information  such  as  an 
aprior  distribution  on  the  fixed  effects  and 
certain  types  of  prior  estimates  of  the  variance 
conponents  can  be  included  directly  into  the  SI 
algorithm. 


4.2  Maximum  likelihood  and  restricted  maximum 
likelilxxxi  estimation. 


To  use  t))e  SI  algorithm  in  the  mixed  model 
setting,  we  assume  tliat  the  individual  subject 
random  effects  are  missing  data.  If  the  P|S  were 
all  known  tlien  the  likelilxxxi  equations  for  the 
variance  components,  D  aixi  v*  would  be 


D  -  E  p.pj '/m 
i-1 


o*  -  (  E  (yj-ZjPi-X^a)(yj-Z^Pj-X^o)')/N  («) 

with 

i-  (E  (Xi’Ei”\))‘\’Ep”^(yi-ZjP£)  (7) 

m 

and  N  -  E  n. . 
i»l 

Laird  and  Ware  (1982)  show  that  Iteratively 
replacing  t)ie  right  )iand  side  of  equations  (5) 


s;”  VI  Vi  W  iO\KnKrnv\nn.>v\Mf-,mr.mnmn 


and  (6)  with  their  expected  values,  given  ^ 
data  the  parameter  estimtes  D-0““^'  and 

from  the  previous  iteration,  and 
then  recalculating,  is  an  EM  algorithm  algorithm 
as  studied  in  Denpster  Laird  and  Rubin  (1977). 
Laird  and  Ware  discuss  tvx>  different  approaches 
towards  performing  the  estimation  —  that  is.  of 
.calculating  (Iteratively)  the  expected  values  of 
D  and  o>.  Hie  first  is  to  assume  that  the 
fixed  effects,  a.  are  fixed  but  unluiown 
parameters  to  be  estimated,  in  which  case  the  EM 
procedure  yields  maximum  likelihood  estimates 
(Ht£).  The  second  approach  they  call  enpirical 
Bayes.  Hiey  inpose  an  inproper  prior 
distribution  on  a  as  normal  with  mean  zero  and 
an  (infinite)  covariance  matrix  V.  defined  so 
that  V*  ”  0.  In  this  case  the  EM  procedure 
gives  estimates  of  D  and  o*  which  are  equivalent 
to  the  restricted  maxlnum  likelihood  estimates 
(BEML)  discussed  by  Patterson  and  Thonpson 
(1971). 


4.2  Conputing  Fonulae. 


:,(«) 


-1 


m  m 

ii:^yi’yi-M!<«ly>s^Xi'yi 


m  , 

-g^E(Pjly)’2£'y£ 


m  m 

♦  E(oly)'(E  Xi'X.)E(aly)  +  tr(V(oly)  E  Xi’X. 

1-1  ^  ^  i-1 

m  n 

+  E  E(p^ly)'Zj'Zj  E(pj^ly)  +  tr(E  Z^'Z^  VCp^ly)) 

m 

+  2  E(aly)'  E  Xj'Zj  E(Pjly) 
m 


t^re 


+  2^r^tr(Xj’Zi  COVCb^.oly))) 


H  ^^E  (Xj^’Zj(Zj'Z£+o»D‘^)~*  Zj'yj^ 


E(oly)  -  E  Xi'y. 

1-1  ^  ^ 

n-li-1 


For  maxlnum  likelihood  estimation  the 
iterations  become: 

»  (E  X.'X.)'^  X  X.(y.-Z.E(B.|y)) 
1-1  i-1  ‘  ^  ^ 


5(») 


(1/m)  E  lE(^ily)E(pJly)'+V(^ily)l 


„.(•)  .  <i/N)  X  ((yj-Xjo)(y£-Xio)‘ 

i-1 


m 

V(oly)  -  o*  I  X  Xi ’X. 

i-1 

m 

-^E  X£’Z£(Zj'Zj+w*D“^)'*  *i’*iJ’^ 

E(pjly)  -  (Zi’Zj+«»D“*)"l(Zi*yi-  Z|XiE(oly)) 
V(pily)  -  B»  I(Zi'Zi+o»D"^)"^+ 


-2(yj-Xj«)'ZiE(bily) 


C0V(bj,Bly)  -  -o»  (Zj'Zj+o»D"^)“^  Z^’X^H”^ 


+  E(Pjly)  'Zj^'Z^E(p^ly)+tr(Z^'Z^E(p£ly)E(Pj^ly) ') 
+  tr{Zj'ZjV(Pj^ly))l 

Where  E(Pjly)  -  (Zj’Zj+®*D"^)Zj’(yj-X£o) 
and  V(p^ly)  -  v*(Zi'Zj^+«»D"^)"^.  (8) 


For  notatlonal  convenience  the  iteration  nunber 
(w-1)  has  been  suppressed  in  the  right  hand  side 
of  these  expressions. 

For  REML  estimation  the  conputing  fomulae 
(from  Cook  1982)  are  a  bit  more  conplicated. 
they  are: 


and  where 

H  -  X’X  -  X'Z  F"^  Z'X 

F- 

(Zl'Z,+o«D"*)  0  ...  0 

0  (Z2’Z2+o*D"^)  0  ...  0 


[i...  ‘o  (vC*»*D-'> 

and 

Gi  -  (Zj'Zj+o«D"^)“^  Zj'Xj 


-  (1/m)  E  E(P4ly)E(p^ly)'+V(p^ly) 


4.4  The  EM  algorithm's  speed  of  convergence. 

Oie  cciimon  criticism  of  the  use  of  the  HI 
algorltlim  in  many  settings,  not  just  variance 
coiporent  estimation,  is  that  it  can  be 
extremely  slow  to  converge  —  often  even  when 
other  metixjds  such  as  Newton-Raphson  or  Fisher’s 


■sfiw,riiNWJv\irirjiaNWiaeaKwviriaK.^ii[^^  u"«jinm^uKi.niwCTCTu»\igutjiwwi»a«»ricini 
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scoring  converge  rapidly.  Hie  reason  that  this 
can  be  the  case  Is  that  the  EM  algorithm  Is  a 
first  order  successive  substitution  method  — 
and  thus  will  exhibit  linear  convergence  at  the 
end  of  the  Iterations.  To  see  this  let  e  be  the 
vector  of  parameters  to  be  estimated  by  EM.  Tor 
the  longitudinal  random  effects  model  e  consists 
of  a*  and  the  qlq+D/l  distinct  conponents  of  D. 
The  EM  idgorlthm  at  the  «th  Iteration  consists 
of  the  successive  substitution  step 


0<“>  -  q(e<*~^’) 


(11) 


tAiere  g  represents  the  entire  EM  etep  —  ie  the 
updating  fomulae  (8)  or  (9)  and  (10).  Using 
the  first  term  of  a  Taylor  series  expansion  of  g 
we  can  write 

a  j<w-i) 

where  J  Is  the  matrix  of  partial  derivatives 
9 

-  g(8) 
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evaluated  at  8***”^*.  Assuming  suitable.  , 
differentiability  conditions  hold,  as  8'*’' 
approaches  9{8p  —  that  Is.  as  the  EM  algorithm 
converges.  J'**'  will  converge  to  J’.  and  for  w 
large  enough  we  will  have 

0(«+l)  _  0(w)  a  J-  (elw)  - 

to  any  desired  degree  of  precision.  Ehrther 
Iterations  produce  differences  In  the  parameter 
estimates  iteratively  as 

p(k+«»+l)  _  a(k+w) 

»  (J“)*  (8<*’  -  8'"“^’).  (12) 

But  this  Inplles  (see  Gerald  1970  p  182.  for 
exanple)  that  the  left  hand  side  of  the 
preceding  equation  will  approach  m  eigenvector 
associate  with  the  largest  eigenvalue  of  j" 
(so  long  as  1  Is  distinct).  He  see.  therefore, 
that  the  limiting  rate  of  convergence  of  the  Ql 
will  be  determined  by  the  size  of  X.  which  can 
be  shown  to  be  real  and  between  zero  and  one 
(see  Denpster  Laird  and  Rubin.  1977).  If  X  Is 
near  one  then  the  EM  algorithm  will  be  extremely 
slow  in  converging  since  the  step  sizes  will  be 
small.  On  the  other  hand  if  X  is  near  zero  the 
algorithm  will  be  rapid  (though  still  linearly 
convergait)  in  the  final  stages. 

4.5  Speeding  up  the  Q(  algorithm. 

The  methods  we  discuss  here  are  applicable 
for  accelerating  the  convergence  of  any  linearly 
convergent  successive  substitution  algorithm. 
They  can  be  considered  to  be  nultlvarlate  forms 
of  the  Aitken  acceleration  method  (Gerald  1970). 
The  basic  idea  is  to  enploy  either  an  estimate 
of  or  of  X  to  change  tlie  convergence  behavior 
of  the  EM  algorithm  from  linear  to  quadratic. 


It  is  useful  to  monitor  the  convergence  of 
the  Ql  algorithm  by  estimating  x  is  the  course 
of  the  Iterations.  One  reasonable  estimate  of  x 
might  be 

8 

X  -1/BI  (e{“’-8}*"*’)  /(e{“"^’-e}““*’) 
i-i  ^  ^  t  t 

(13) 

there  s  is  the  number  of  conponents  of  8.  This 
is  the  mean  of  the  ratios  of  the  differences  of 
the  individual  parameter  estimates  obtained  in 
the  most  recent  two  iterations.  From  equation 
(12)  it  is  clear  that  as  m  approaches  -  this 
will  converge  to  X.  If  edl  of  the  parameter 
changes  are  approximately  proportional,  that  is, 
if 

(e(w)  _  ■  x(8*“"^'  -  8*““*^) 

for  1-1..... s. 


and  if  a  is  between  zero  and  1.  then  it  is 
appropriate  to  use  X  to  speed  convergence, 
equation  (12)  we  can  write 


From 


8 


_  sU-1)  . 


X  x’'(8'***-8**"^^) 
i-1 


1/(1-X)  (8‘'*’-e‘"‘^’>. 


Thus  we  can  estimate 

.  e<«-2)  +  i/(i-x)  (e‘"^-e‘*"^’).  (14) 

This  estimate,  e”.  could  then  be  used  instead  of 
9'*'  in  furtlier  iterations..  Of  course  it  would 
be  advisable  to  check  that  e”  actually  increases 
the  likelihood  over  e''*^  just  to  be  sure.  This 
is  essentially  the  same  thing  as  applying  a 
univariate  Aitkens  acceleration  to  each  of  the 
parameters  being  estimated. 

Figure  1.  5)iows  plots  of  several  of  tlie 
variance  conponent  estimates,  against  iteration, 
.nunber.  calculated  for  the  HRE  data.  For 
illustrative  purposes  in  this  plot,  extremely 
poor  initial  values  for  D  and  o*  were  purposely 
used  here.  After  the  iterations  had  been  run 
six  times  we  calculated  X  and  the  RMSE  of  the 
suntrands  of  (13)  as  equal  to  0.2204  and  0.0285 
respectively.  Since  X  was  relatively  close  to 
zero  with  a  small  heterogeneity  over  the  s 
conponents  of  8.  we  expect  at  this  point  in  the 
iterations  tliat  the  EM  will  converge  readily,  as 
seen  in  Figure  1.  At  this  point  in  t))e 
iterations  we  can  apply  equation  (14)  with  w  -  < 
alt)K)ugh  it  probably  is  unnecessary  to  do  so 
since  X  is  so  small. 

Another  approach  towards  speeding  up  tlie 
algorithm  is  to  estimate  j"  rather  than  X  and 
use  a  iniltivarlate  generalization  of  the  Aitken 
acceleration  procedure.  Since  Jj"'  oenerally 
must  approach  j"  before  8'“'  -  8'*^*'  converges 
to  an  eigenvector  of  J**  we  see  that  J**  can  often 
be  estimated  earlier  than  X. 
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Figure  1 


Plots  of  Several  Variance  Component  Estimates  Against  Iteration  Number 
for  the  MRE  Data. 

Hie  queetion  which  r^iains  is:  How  does  one 
■niis  method  essentially  amounts  to  enploylng  a  estiimte  J't  One  could  of  course  estiimte  j"  by 

Newton  step  to  help  solve  the  likelihood  j'w'.  For  maximm  likelihood  estimation,  it, is 

cations  as  writt^in  expression  (n).  if  we  not  too  hard  to  give  explicit  fomulae  for  j'*’. 

6-6  ase  -6  +  either  by  directly  differentiating  the  updating 

6'*:.  t  f  -  and  if  we  approxinate  fontulae  presented  in  Eq  (8)  of  Section  4.2,  or 

by  using  methods  discussed  Louis  (1982). 

.  ,  These  calculations,  however,  would  seem  to  get 

8  _  0'8  •“  unbearably  messy  for  FOIL  estimation.  It  is. 

nevertheless,  not  generally  necessary  to  know 
1  approaches  «  the  form  of  j'"'  ,  to  attenpt  the  speedup.  We 

can  instead  approximate  J  from  the  past  history 
"  .  /  »  /  ,.  of  the  iterations  themselves,  Thus  for 

(j")*')  M  >  s  we  can  approximate  as 


j(k+«-l)  (glk+w)  _  g(k+.-i) 

we  have  as  1  approaches  ^ 


0- _  9<.-l)  •  ij 

k-l 

Since,  from  Denpster  Laird  Rubin  (1977),  has 
all  its  eigenvalues  between  zero  and  one  the 
power  series  converges  and  is  equal  to  (I-j') 

Thus  by  approximating  j"  we  can  try 
speeding  up  the  algorithm,  estimating 

.  eU-i)  +  (i-j-)-l  (9<«’-9<--t>).  (15) 

Then  (after  checking  that  9"  indeed  increase;^ 
the  likelihood  over  9'*')  we  can  substitute  9" 
for  9'“'  in  further  iterations. 


•  >  s  we  can  approximate  J  “  as 

J  -  eg  (eg"*)'^  (16) 

where  eg  is  an  sxs  matrix  of  form 

[9(w).e(.-l)  |e(.-l)_9(i.-2)  I . . ,  |0(w-8+1)_0(b-8)  J  _ 

As  .  approaches  «■  this  procedure  becomes 
numerically  unstable  bemuse 

(e<«-*>-9<w-*>)  a  X  (e‘“"*’-e‘“"*’) 

and  BO  the  inverse  of  eg~f  no  longer  exists.  Of 
course  when  this  occurs  we  can  simply  switch  to 
the  'X-method'  to  acconpllsh  the  same  thing. 


While  for  the  MRE  data  the  EM  iterations 
converged  quite  readily  it  is  not  hard  to  find 
exanples  of  slow  convergence.  Figure  2  gives 
plots  of  estimates  arising  from  a  growth  curve 
problem  i4iere  convergence  was  extremely  slow. 
We  notice  that  the  first  few  iterations 
(starting  from  fairly  poor  initial  values) 
produced  large  step  sizes  but  in  the  later 
Iterations  the  edqorithm  was  verv  reluctant  in 
approaching  its  final  values.  E!ven  after  more 
than  one-hundred  iterations  the  variance 
conponent  estimates  continued  to  change  in  the 
third  decinal  place  from  step  to  step.  After 
six  iterations  of  the  EM  on  these  data  we' 
estinoted  J  using  Eq  (1<)  as 

0.7607  2.7226  1.3997  -4.1169 

-0.0178  1.7790  0.3019  -1.3840 

-0.4552  -6.0342  -2.5391  8.0242 

-0.1122  -0.5491  -0.5458  1.4118 


We  find  that  the  largest  eigenvalue  of  this 
matrix  equals  0.899  which  corresponds  well  with 
the  slow  convergence  of  the  estimates  observed 
in  Figure  2.  However  at  Iteration  6  the  use  of 
the  'X-method'  seemed  inappropriate  since  the  ' 
sunmands  in  E2]  (13).  namely 


(ei«>  -  ei*>)  /  (eP>  -  ei^>) 

varied  areatly.  from  2.75  to  0.09,  indicating 
that  (ej**  -  ej^’)  was  nowhere  near  an 
eigenvector  of  J**.  Nevertheless  good  results 
for  these  data  were  obtained  by  the  use  of 
the  nultivariate  Aitken's  acceleration  method 
(15)  when  this  procedure  was  implied  at  the  6th. 
12th.  and  18th  iterations.  The  results  are 
shown  in  Figure  2  as  the  line  on  the  plots  which 
begins  at  iteration  7. 

(Xir  reconmendatlon  for  exploiting  these 
extremely  slnple  procedures  for  accelerating 
convergence  is  to  attenpt  to  use  Aitken's 
acceleration  method,  Ek]  (I5),  first,  but,  if 

is  too  illcondltioned  to  Invert,  to  switch 
to  the  X-method.  F>g  (14)  where  the  largest 
eigenvalue,  x,  is  estimated  from  Eiq  (13).  In 
pesslitg  we  note  that  the  conputational  burden  of 
these  techniques  is  far  less  than  that  of 
performing  an  Q(  step  and  thus  should  always  be 
considered  as  a  convergeixre  accelerator,  or  in 
fact  in  any  linearly  convergent  iterative 
algorithm. 


Figure  2  :  Plots  of  Variance  Component  Estimates  for  Crowth  Curve  Example. 

Also  Shovm  are  the  Results  of  Aitken’s  Acceleration  Procedure. 


4.6  lixx>rporatlon  of  'prior-lnfornetlon'  on  the 
variance  oooponents. 


In  Sectlcm  4.2  w  noted  that  the  Q1 
algorithm  is  well  suited  to  estimation  when  an 
inproper  prior  distribution  is  placed  on  the 
'fixed  effects',  a.  for  enpirical  Bayes 
estimation,  idiich  we  note  is  equivalent  to  RfML 
estimation.  The  Ql  algorithm  is  also  suited  to 
the  incorporation  of  certain  types  of  prior 
infornetion  on  the  conponents  of  D. 

Suppose  that  we  have  a  prior  estimate  D-  of 
D  and  further  suppose  that  we  think  of  lu  as^ 
having  resulted  from  observing  n.  independent 
'unobservables',  pi.  for  i— n-*!.. ...0.  (The 
negative  index  indicating  the^prior  nature  of 
the  knowledge  of  the  p£>.  under  this, 
euimittedly  artificial,  assunption  it  can  easily 
be  shown  that  the  Ql  step  for  maximizing  the 
conbined  likelihood  of  the  observed  yi  (i>0)  and 
(i  10)  data  is  to  simply  let 

m+np 

Here  is  the  usual  Ql  estimate,  as  given  in 
Section  4.3.  at  the  M^^iteration.  but 
calculated  using  as  the  estimate  at  the 

previous  iteration. 

Of  course  it  would  be  quite  unusual  if  a 
meaningful  estimate  of  D  was  available  before 
the  start  of  the  experiment,  much  less  that  the 
estimate  had  been  derived  by  measuring 
unobservables.  Nevertheless  this  procedure  may 
still  have  utility  in  certain  cases.  Helms 
(198S.  paper  read  this  session)  reports  a  number  . 
of  Instances  when  (using  Fisher's  scoring  to 
find  HLE  estimates)  the  values  of  D  and  idilch 
solved  the  likelihood  equations  were  outside  the 
parameter  space.  That  is.  1$  had  one  or  more 
negative  eigenvalues.  When  using  the  Ql 
edgorithm  in  such  circumstances  the  eigenvalues 
of  IS  will  not  actually  b;  permitted  to  become 
negative,  the  estimate.  D.  will  Instead  head 
towards  a  point  on  the  boundary  of  the  paraneter 
space  as  a  limit  which  is  never  entirely 
obtained.  In  this  case  it  would  seem  entirely 
justifiable  to  pull  back  D  from  the  boundary  in 
a  specified  direction,  perhaps  towards  the 
identity  matrix.  Thinking  about  this  procedure 
in  terms  of  the  employment  of  a  'prior'  estimate 
of  D  means  t)iat  we  can  characterize  cur  final 
estimate  in  terms  of  the  strength  of  the  prior 
information  used,  that  is.  the  size  of  np.  to 
produce  the  final  estimate.  This  process  would 
seem  to  roughly  correspond  to  the  ridge 
regression  approach  ti^rds  least  squares 
fitting. 

5.  RESULTS  FOR  THE  PRINCETON  DMA.  TWO-STHX 

VS  LONGITUDINAL  RAMXH  EFFECTS. 


estimation,  the  longitudinal  random  effects 
model  discussed  in  Section  3.3  to  the  HRE  data 
are  shown  below. 


a 

1.512 

>‘o 

-0.081 
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-0.152 
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-0.173 
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0.228 
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-0.026 
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-0.013 
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0.1724 

-0.0180 

0.0136 

-0.0010 


0.0106 

-0.0005  0.0047 

0.0007  -0.0010 


0.0006 


The  estimate  of  the  variance  covariance 
matrix  of  the  fixed  effects  is: 

15.28 

-4.12  24.03 

-0.15  -22.24  36.62  X  10"* 

0.09  -22.34  22.30  42.53 
0.77  0.18  0.01  -0.01  0.28 

0.18  -0.95  0.85  0.86  -0.02  0.09 

0.00  0.85  -1.41  -0.86  0.00  -0.09  0.15 

-0.01  0.86  -0.86  -1.63  0.00  -0.09  0.09  0.17 


Table  1  conpares  the  results  obtained  using  the 
longitudinal  random  effects  methods  with  those 
from  the  two-stage  analysis. 

we  note  that  the  IKE  analysis  gives  greater 
statistical  significance  to  the  changes  in 
lasting  slope  and  less  to  the  changes  in  base 
level  than  does  the  two-stage  analysis.  This 
seems  to  reflect  the  fact  that  imlssing  data  wre 
more  common  in  the  summer  months  of  the  study 
than  in  the  winter,  since  in  general  missing  a 
summer  datapoint  las  more  effect  on  the 
variability  of  the  base-level  parameter  than  on 
the  heating  slope.  The  group  witli  the  largest 
proportion  of  missing  data  was  the  house  doctor 
group  in  the  post-period.  It  is  this  group's 
estimate  of  change  in  heating  slope  (over  that 
of  the  controls)  for  whicl\  the  conclusions  of 
the  two  analyses  differ  most  markedly. 

6.  (XNCLUSIONS  AND  REMARKS. 

Che  reason  for  offering  the  Modular 
Retrofit  Experiment  data  as  an  illustrative 
example  for  the  longitudinal  random  effects 
model  is  that  it  raises  several  interesting 
model  specification  issues.  For  example,  tl» 
assumption  that  the  error  variance,  o*,  is  the 
same  for  all  subjects  is  likely  Inappropriate 
for  these  data.  Moreover,  the  form  of  the 
intra-subject  models  used  iiere  is  less  than 


The  results  for  fitting,  using  REML 


Tftraf  1 


Esruuvm)  PABAHEriEB  CHANGES  IN  IHEMMENI  GHOUPS 
OVER  OONIKXS  FXXl  THE  DATA 

IRE 

ANAUSIS 

Heating-Slope  Base-level 


House-  -0.011  -0.138 

doctor  (-2.898)  (-2.274) 

Major-  -0.033  -0.161 

retrofit  (-7,968)  (-2.474) 


THO-STAGE 

ANAUfSIS 

Heating-slope  Base-level 


House-  -0.008  -0.176 

doctor  (-1.21)  (-3.42) 

Major-  -0.030  -0.212 

retrofit  (-4.37)  (-3.79) 


t-statistics  are  shown  in  parentiieses. 


optinal  as  well.  In  modeling  the  individual 
houses  here,  all  heating  degree-days  were 
calculated  at  the  arbitrarily  fixed  tenperature 
setting  of  60^.  A  more  physically  meaningful 
model  for  the  individual  houses  involves  the 
estinetion  of  t)ie  heating  degree-day  reference 
tenperature,  as  in  Butt  et  al  (1982),  for  each 
house  in  each  of  the  pre-  and  post-periods. 

That  is,  the  model  shwld  take  into  account  the 
possibility  of  between-subject  variation  in 
tiiermostat  settings  or  other  physical  factors 
v)iic)i  affect  the  reference  tenperature  at  which 
a  iMuse's  gas  furnace  turns  on  as  tenperature 
decreases.  Such  reference  tenperature 
estimation,  Ixjwever,  produces  run  intra-subject 
model  which  is  intrinsically  nonlinear  in  its 
paraneters.  The  incorporation  of  nonlinear 
intra-subject  models  into  the  IRE  setting  nust 
be  regarded  as  an  area  open  for  further 
research.  Ikitil  ttie  significance  of  these 
depertures  in  node!  sieclflcation  are  further 
investigated  —  or  until  the  IRE  model  is 
furtlier  extended,  the  two-stage  analysis  of 
these  data  would  seem  to  be  the  most 
trustworthy.  Nevertheless,  the  conparisons 
between  the  two-stage  results  and  those  for  the 
IRE  model  are  very  intriguing. 

Che  conmon  conplaint  about  the  Q1 
algorithm,  when  conpared  to  gradient  methods 
like  Fisher's  scoring,  is  that  at  the  end  of  the 
iterations  we  are  left  without  the  usual 
information  matrix  estimate  of  the  variance 
covariance  matrix  of  the  paraneter  estimates. 

We  note  here,  )K>wever.  that  in  the  variance 


ccsponents  problem,  this  criticism  is 
Inapprcprlate  if  one  is  primarily  interested  in 
the  estimates  of  the  fixed  effect  leraxeters. 
When  using  ML  estimation  the  expected 
Information  estimate  of  the'  asymptotic  variance 
of  the  fixed  effects  gives 

m  ,  '11 

Asm  Var  (a)  -  (S  X, ’(o»I  +  Z,DZ, ')"*X,) 

1-1 

Thus  then  conputing  the  asymptotic  variance 
covariance  matrix  of  a  we  do  not  include  any 
inforration  concerning  tlie  variability  of  our 
estimates  of  D  or  a*.  This  estimate  of  the 
variance  of  a.  or  any  linear  conbination  of  a, 
can  be  conputed  once,  at  the  end  of  the 
iterations.  While  Fisher’s  scoring,  unlike  the 
Qt,  autonetically  gives  information  about  the 
variability  of  D  and  o*.  at  t)ie  end  of  the 
iterations,  it  does  not  give  any  way  to  make  use 
of  this  Information  in  refining  the  estimates  of 
the  variance  of  a,  vhich  is  the  issue  most  often 
of  interest.  The  fact  that  for  ML  estimation  an 
information  matrix  for  the  variance  components 
is  available  using  Flslier's  scoring,  but  not 
from  the  EM  algorithm,  does  not  alone  seem  to  be 
important  enough  to  govern  the  clxiice  between 
algorithms,  at  least  in  most  common  applications 
of  the  IRE  imodel. 
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The  intended  purpose  of  this  workshop  was  to  bring  to  light  ideas 
relating  to  the  effectiveness  of  use  of  statistical  software.  That  i5« 
the  fact  that  a  particular  piece  of  statistical  software  is  capable  of 
performing  a  given  tasi<  is  to  be  considered  within  the  perspective  of 
the  ease  and  efficiency  with  which  a  user  can  avail  h i mse 1 f / her se 1 f  of 
this  functionality.  The  discussions  reported  herein  focussed  on 
benchmarking  and  the  desire  of  users  to  be  able  to  deal  with 
categorical  variables  which  have  an  underlying  orderingi  as  well  as 
some  of  the  mundane  but  Important  details  of  statistical  computing. 
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INTRODUCTION 


suitability.  A  previous  workshop 
entitled  "Which  tools  to  use  in 
statistical  analysis?  Choices  of 
hardware  and  software”  C93  was  held  in 
Ottawa  on  November  8,  1984  as  a  prelude 

to  the  present  discussion.  It  took  as 
its  perspective  the  choices  open  to  the 
user  in  attempting  to  solve  a 
particular  statistical  problem.  In  the 
present  workshop  it  was  intended  that 
the  emphasis  would  shift  slightly  to 
give  an  overview  of  the  process  by 
which  software  might  be  assessed  and 
selected  in  order  to  develop  measures 
of  "performance*  of  the  statistician 
with  the  statistical  software. 
Recognizing  that  this  goal  is 
ambitious,  it  was  gratifying  to  note 
the  willingness  of  conference 
participants  to  cooperate  in  developing 
ideas  in  this  area. 

The  workshop  was  moderated  by  J.C.Nash 
who  wrote  notes  directly  on  overhead 
slides  which  were  then  drafted  into 
this  report  with  the  help  of  some  of 
the  participants  (identified  in  the 
Acknowledgements)  The  report  is 
structured  as  a  dialog,  though  the 
editor  has  taken  some  liberties  in 
expanding  the  original  notes  to  clarify 
the  ideas.  Due  to  time  constraints  in 
preparing  copy  for  publication,  some 
references  and  statements  remain 
incomplete  and  are  marked  as  such  by 


This  workshop  was  organized  in  an 
attempt  to  bring  together  statisticians 
and  designers  of  statistical  software 
so  that  an  exchange  of  ideas  might 
result  in  the  future  development  of 
statistical  software  well-suited  to 
particular  classes  of  users  and 
procedures  for  assessing  this 


DISCUSSION 

Tung:  Can  we  focus  on  the  following  2 
i  deas : 

1.  the  development  of  benchmark 
problems  and  data  sets! 


I 


/ 


i  /  ’ . 


2.  th*  criteria  for  assessing 
hON  well  the  software  has  handled 
t  hese*^ 

General  group:  -  agreement  to  this 
suggestion 

-  suggestion  that  linear 
regression  benchmarks  be  the  first 
target  problem. 

Simon:  There  are  two  benchmark  data 
sets,  the  Longley  dataset  which  is 
routinely  applied  by  software  reviewers 
and  the  Wamp 1 er /Lauch 1 i  dataset. 

Nash:  There  is  also  the  Wampler 
polynomial  data  sets. 

a )  Long 1 ey  t  6 1 

Scheunemeyer :  This  data  has  7 
independent  variables  (plus  the 
constant)  for  16  time  periods.  The 
dependent  variables  relate  to 
employment.  The  independent  variables 
are  highly  collinear  and  there  is  a 
scaling  problem. 

Kolesar:  The  Longley  set  is  good  for 
testing  for  these  difficulties. 

Scheunemeyer :  Even  extra  years  of  data 
do  not  improve  the  col  linearity  to  an 
appreciable  extent. 

b)  Wampler  polynomial  least  squares  C101 

Nash:  This  data  offers  several  problems 
with  increasing  co 1 1 i near i t y ,  though 
they  are  not  parametrized  C81. 

c)  Wamp 1 er /Lauch 1 i  C4,  11] 

Simon*.  The  Wamp  I  er /Lauch  1  i  set  is  a 
parametrized  set  whose  regression 
coefficients  can  be  shown  analytically 
to  be  a  column  of  I's.  By  steadily 
lowering  a  parameter  epsilon  towards 
zero,  you  make  the  columns  of  the  X 
matrix  (excluding  the  Intercept) 
increasingly  collinear.  The  advantage 
of  this  set  is  that  you  can  examine  a 
package’s  performance  with  both 
moderate  and  extreme  examples  of 
i 1 1 -cond i t i on i ng .  Longley  gives  a 
single  extreme  that  may  or  may  not  be 
representative  of  the  data  sets  one  is 
likely  to  encounter.  One  criticism  of 
t)ie  W/L  set  is  that  it  is  artificially 
generated  (see  Lesage  and  Simon, 

C4]  )  . 

Lane:  Such  data  sets  are  mostly  useful 
for  testing  diagnostics. 


Several:  It  is  important  to  decide  what 
to  do  when  collinearity  is  diagnosed. 

Nash:  In  forecasting/prediction 
applications  we  may  not  need  the 
coefficients  so  that  a  minimum  length 
least  squares  solution  18,  p.l7]  may 
be  useful.  This  is  equivalent  to  some 
principal  component  solutions. 

Scheunemeyer:  The  ridge  regression 
methods  may  also  be  reasonable 
alternatives  for  both  forecasts  and 
parameters . 

Simon:  There  are  software  design 
questions  related  to  this  discussion. 

Most  packages  use  either  of  the 
following  choices: 

1.  try  to  give  the  "best 
possible"  answers  for  any  data 
set  (with  a  warning  given  to  the 
user  when  needed)! 

2.  refuse  to  analyze  any  data 
set  with  extreme 

i 1 1 -conditioning! 

Lee:  Choice  3  is  choice  (2)  with  remedial 
actions  suggested  by  the  program. 

Several:  Choice  (1)  makes  it  too  easy 
for  users  to  continue  BUT  also  more 
options  for  informed  user. 

Nash:  It  is  not  widely  recognized  that 
elimination  methods  ("sweeping")  may 
not  flag  r ank -def i c i enc i es .  The 
typical  pivot  tests  are  sufficient  but 
not  necessary  conditions.  There  are 
some  examples  of  matrices  which  appear 
well  behaved  but  are  quite  close  to 
being  singular,  for  example,  the  Moler 
matrix,  C3,  p.210]. 

Kolesar:  We  need  to  distinguish  special 
cases  versus  general  packages.  This 
could  involve  different  control 
parameters  for  the  software. 

Consensus:  1.  A  "long  pause"  is  needed 
when  collinearity  is  detected,  with 
questions  posed  to  a  user  which  require 
that  he/she  understand  the  consequences 
of  proceeding. 

2.  Software  shou 1 d  suggest  remedial 
action  to  overcome  the  collinearity. 

Several:  What  about  judging  software? 

Ling:  There  are  questions  of  numerical 
versus  statistical  accuracy. 

Wang:  Numerical  accuracy  is  a  function 
of  algorithm  and  precision  available. 


•\.V. 


Simon:  IF  th»  al9orithm  and  arithmattc 
ar a  correctly  programmed. 

Ling:  NOT  true,  due  to  hardnare  and 
data  dependenc ies. 

(Editor:  This  difference#  warn  not 
totally  resolved.) 

Several:  Statistical  accuracy  is 

beta-hat  "near"  center  of  the 
distribution  of  possible  estimates 
given  minor  perturbations  in  the  data*^ 
This  is  related  to  the  numerical 
accuracy.  [Readers  should  also  note 
the  paper  by  O.W. Stewart  elsewhere  in 
these  proceedings. 1 

Nash:  Are  there  any  parametrized  data 
sets  which  allow  the  difficulties 
discussed  to  be  tested  AND 
perturbations  to  be  introduced'? 

Simon:  We  have  some  work  in  progress 
t53  . 


problem:  Marathon  run  times  (courtesy 
Roland  Thomas,  Carleton  U. ) 

Problem  type:  estimation  of 
d i s t r i bu t i ona 1  form  /  parameters  and 
testing  of  various  hypotheses. 

Originators:  Roland  Thomas,  John  Nash 

Features:  The  times  of  all  marathon 
races  run  in  Canada  under  an  arbitrary 
3  hour  limit  were  recorded  for  several 
years  for  both  men  and  women.  It  is 
desired  to  fit  the  distribution  of 
times  for  each  sex  separately  to  a 
mixture  model  that  might  represent 
competetive  and  recreational  runners 
within  each  separate  sex  group.  The 
underlying  hypothesis  is  that,  though 
the  "average"  difference  between  the 
sexes  is  of  the  order  of  30  minutes, 
the  difference  between  "competetive" 
times  for  males  and  females  is  much 
less,  perhaps  closer  to  the  actual 
record  differences,  which  are  of  the 
order  of  15  minutes.  Since  several 
years  of  data  are  available,  one  may 
also  hope  to  observe  the  changes  in 
performance  levels  of  both  sexes. 

There  are  particular  aspects  of  the 
problem  which  make  it  quite  difficult: 

1)  the  data  set  is  quite  large  (5000^ 

observations  each  year) 

2)  there  is  no  a  priori  model  which 

may  be  suggested  other  than  the  two 

popu lation  mixture 


3)  the  sample  is  censored  by  the 
arbitrary  3  hour  limit,  which 
eliminates  a  larger  proportion  of  one 
^ex  than  the  other.  Furthermore,  we 
have  no  way  to  include  runners  who  do 
not  finish.  (Several  participants 
wanted  to  know  if  there  was  any 
«ndicatiori  of  the  number  of 
starters. ) 

Commins:  Higher  participation  rates  by 
women  might  explain  relative 
improvement  in  performance  by  women. 

Suggestion  by  Innis  Sande  (Statistics 
Canada)  conveyed  to  participants  via 
moderator:  Need  a  "marathon"  effect  to 
account  for  differences  in  terrain  and 
weather . 

Moderator  briefly  presented  R. Thomas 
approach: 

1)  plot  the  cumulative  distribution 
of  times  for  each  race/sex 

2)  try  a  mixture  of  normal 
distributions  for  2  groups  (elite, 
recr eat ional ) 

3)  use  a  (random)  subset  of  the  data 
for  preliminary  determination  of  the 
distributional  parameters  to  save 
computing  time. 

Sacher:  Is  it  really  necessary  to 
consider  a  random  subset,  since  a 
convenience  subset  (systematic 
sampling)  would  probably  suffice  for 
the  purpose? 

PROBLEM:  Data  handling  and  presentation 

Problem  type:  data  manipulation, 
tabulation  and  graphing 

Originator:  Judd  Hampton,  Agriculture 


Features:  This  problem  involves  the 
handling  of  a  relatively  large  number 
of  variables  on  an  ongoing  basis,  and 
the  preparation  of  tables  and  graphs 
based  on  this  data  on  both  a  regular 
and  ad  hoc  basis. 

Backgr ound /c I  1 ent  group:  Marketing  and 
Economics  Branch,  Agriculture  Canada, 
produces  quarterly  Market  Commentaries 
for  Grains  and  Oilseeds,  Dairy, 
Livestock,  Horticulture  and  Special 
Crops,  Poultry  and  (consumer)  Food 
sectors.  These  commentaries  report  the 
situation  and  outlook  for  each  sector 
and  are  used  by  producers,  the 
financial  community,  government  at 


dif-ferent  levels,  a9ribusiness  and 
consumer  a9encies.  Note  that  the 
output  requires  accented 
characters. <Many  popular  software 
pacUa9es  such  as  Lotus  1-2-3  cannot 
easily  be  modified  to  allovn  9raphical 
or  printed  output  with  such 
characters. ) 

Problem:  The  Statistical  Analysis  Group 
of  H&:E  Branch  has  the  task  of  producing 
detailed  tables  and  graphs  for  the 
commentaries  which  are  used  at  the 
annual  (Agricultural)  Outlook 
conference.  This  involves 
approximately  IS00  camera-ready  graphs, 
which  must  contain  accurate,  up-to-date 
information  presented  clearly  in  both 
official  languages  in  accordance  with 
strict  editorial  standards.  The 
publication  deadlines  are  tight,  and 
often  ar^  close  to  the  release  date  of 
the  source  information. 

Current  approach:  Originally, 
hand-drawn  graphs  and  typed  tables  were 
used.  In  the  early  1970s, 
Hewlett-Packard  desk-top  computers  were 
introduced  (9820  and  9830  series)  with 
plotters  for  graphical  output.  The 
quantity  of  output  was  such  that  one 
plotter  actually  wore  out  the 
potentiometer  slide  wire  (the  only  case 
the  HP  technicians  had  heard  of  in 
which  this  happened).  Now  HP  98d5 
series  machines  are  used.  Software  was 
created  to  maintain  and  update 
databases  containing  monthly  and  annual 
time  series  and  to  print  current  and 
past  data  and  five-year  averages.  A 
Multiviriter  letter  quality  printer 
produces  the  final  tables  for 
phot or educ t i on .  HP  flatbed  plotters 
are  still  used  to  produce  high  quality 
plots.  In  some  cases  the  multiple  pen 
c ap ab ility  is  used  not  for  colour  bu t 
for  different  pen  widths.  HP  software 
has  generally  not  proved  adequate  to 
meet  production  standards,  and  software 
produced  in-house  is  prepared  as 
needed . 

Kolesar:  For  regular,  i.e.  routine, 
use,  special  purpose  programs  are 
likely  to  be  the  beat  choice. 

Consensus:  the  decision  to  use  a 
special  program  should  be  governed  by  a 
decision  rule 

E(no.  of  uses)  ff  E  ( sav  i  ng/use )  < 

Cost  of  preparing  program 

Lane:  A  session  at  the  Prague  COMPSTAT 
meeting  discussed  such  problems 
( 3,  see  al so  1 1 . 


(Note:  tliere  was  actually  more 
discussion  of  thi  problem,  but  the 
major  points  raised  are  covered  here.) 


PROBLEM:  Contigency  table  with  ordered 
categor i es 

Originator:  Roland  Thomas  (Carleton  U.) 

Features:  A  cr oss - tabu  1  at i on  shows  a  2 
state  response  against  3  categories  of 
one  predictor  and  5  of  another  --  a  2 
by  3  by  5  table.  One  or  more 
predictors  have  categories  which  have 
an  ordering  e.g.  they  represent  the 
ranges  of  a  numerical  variable  which 
are  observed.  How  can  such  data  be 
analyzed  efficiently  while  using  the 
ordinality  present  in  the  predictors? 

Scheunemeyer :  Try  assigning  weights  to 
the  different  categories... 

Lee:  Then  use  a  logistic  model  on  the 
ass i gned  we i ght i ng . 

Lane:  Chapter  ??  in  McCullagh  and 
Nelder  t71  discusses  this  model. 

They  parametrize  the  proportion  of 
response  as  one  cumulates  through  the 
(ordered )  categor i es . 

Lee:  One  has  to  choose  particular 
break-points  in  a  variable  to  give 
appropriate  categories. 

Scheunemeyer:  We  are  trying  to  force 
ordered  categories  into  a  continuous 
case . 

Dumas:  5AS  GSK  command  can  be  used  to 
perform  weighted  least  squares. 

Another  approach  is  GENCAT  (Landis, 
197477) ,  Brown  (BMDP)  is  writing  a  new 
code  to  ... ?? 

Commins:  If  we  get  a  "good"  fit  without 
using  the  ordering,  should  we  continue 
our  analysis? 

Jennings:  Scaled  model  has  one 
parameter  per  variable,  while  the 
Independence  model  has  one  per  level  of 
variable.  If  the  contingency  table 
indicates  independence,  we  may  wish  to 
cont i nue  analysis. 

Simon:  Conover  C2,  pp .  232-234, 

335-338  and  problems  3  and  4  on  p.386] 
mentions  using  ranks  in  a  contingency 
table  with  an  ordinal  category.  This 
approach  relies  heavily  on  average 
ranks . 


At  this  point,  despite  fairly  active 


discussion,  ths  modsrator  had  to  brln^ 
the  session  to  a  close. 
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APPENDIX 

The  Wamp 1 er /Lauch 1 1  dataset 
Y  «  (n  -  1  +  epsilon),  for  i=l 


=  eps i 1  on , 

,  for 

i  =2,  . 

. . , n- 1 

=  ( n 

-  J  - 

-  eps i 1 on ) , 

for  i =n 

X  =  1. 

f  or 

i  =  l, 

j=i. . 

.  .  ,  n 

t  j 

=  1. 

f  or 

>  =  I  • 

i  =  l.  . 

.  .  ,  n 

*  epsilon,  for  i=j=Z,...,n-l 
s  0  otherwise 

(X  is  a  bordered  diagonal  matrix) 
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ESSENTIAL  INGREDIENTS  FOR  A  STATISTICAL  WORKSTATION 


Thomas  J.  Boardman 
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Fort  Collins,  Colorado 


In  the  future  engineers,  scientists,  and  other  professionals  will  perform  many  of 
their  work  assignments  on  computer  workstations.  In  part,  the  renewed  Interest  In 
statistical  methods  as  one  tool  for  helping  Industry  and  government  improves  the 
quality  of  goods  and  services,  justifies  the  need  for  statistical  components  In  the 
workstation.  Some  design  objectives  for  workstations  are  discussed  In  order  to  lead 
into  a  discussion  of  the  necessary  hardware  and  software  ingredients  for  workstations. 
One  scenario  Is  proposed  by  describing  how  the  statistical  functionalities  on  a  work¬ 
station  might  appear  to  the  user  if  the  hardware  has  a  bit  mapped  screen  similar  to 
the  Apple  Macintosh.  Finally  several  challenges  for  the  future  are  described  which 
offer  encouragement  for  improvements  In  statistical  software  In  the  future. 


1,  INTRODUCTION 

My  intention  is  to  challenge  your  thinking  about 
how  people  may  use  statistical  methods  in  the 
future,  perhaps  even  the  way  people  will  first 
learn  about  statistical  methods.  You  might  ask 
why  should  workstations  have  statistical  compo¬ 
nents?  Let  us  start  back  at  the  beginning  and 
discuss  the  increased  Interest  in  computing.  I 
met  Bruce  Woolbert,  of  Hewlett  Packard’s  Person¬ 
al  Computer  Division,  at  the  Pharmaceutical 
Manufacturers  Association  Biostatistics  Sub¬ 
section  Annual  Meeting,  held  in  San  Francisco 
in  October  1984.  He  and  I  had  been  asked  to 
address  the  conference.  Hewlett  Packard 
authorized  a  firm  to  do  market  research  for  them. 
Bruce  Woolbert  reported  some  of  the  results 
during  his  presentation.  One  statistic  he  re¬ 
ported  is  that  one  in  thirteen  office  profes¬ 
sionals  is  currently  using  computers  in  his/ 
her  job  function.  He  went  on  to  say  that  we 
are  increasingly  seeing  new  uses  for  personal 
computers.  People  are  finding  there  are  ways 
that  they  can  use  computers  that  they  had  not 
even  considered  in  the  past.  For  example,  net¬ 
working  of  computing  systems  will  be  much  more 
popular  in  the  future.  More  about  this  topic 
later.  In  fashion  at  the  moment  are  ideas  for 
using  computers  in  new  ways  such  as  computer- 
aided  design,  computer-aided  engineering, 
computer-aided  manufacturing,  and  computer- 
aided  office.  All  of  these  reflect  the  market’s 
movement  toward  integrated  systems. 

Computers  are  used  from  the  bottom  up.  By  that 
we  mean  that  computers  are  now  used  all  the  way 
from  secondary  education  through  college.  Their 
availability  in  education  certainly  has  an  ef¬ 
fect  on  what  we  are  doing  in  our  course  work  in 
higher  education.  Within  the  last  couple  of 
years  I  have  seen  considerable  changes  in  my 
own  department  in  terms  of  the  quality  and 
types  of  computing  that  we  are  doing  in  our 
course  work. 


many  of  our  homes.  Furthermore  I  suspect  that 
computers  will  be  in  a  great  number  of  the 
graduate  students'  homes  after  they  complete 
their  studies.  So  we  can  say  that  computers 
are  for  you  and  me  and  the  kids.  The  pressure 
on  adults  from  youngsters  using  computers  will 
force  adults  to  think  about  how  computing 
actually  needs  to  be  done.  Bruce  Woolbert  made 
an  estimate  based  on  the  same  research  that 
Hewlett  Packard  had  commissioned.  He  reported 
that  by  1995,  65%  of  ALL  office  workers  will  be 
using  computers.  Are  the  adults  ready  for  that 
magnitude  of  commitment? 

Interest  in  statistical  methods  has  been  gener¬ 
ated  by  the  renewed  Interest  in  quality  and 
productivity.  Competition  from  Japan  and  other 
countries  has  awakened  U.S.  Industry  to  the 
fact  that  statistical  methods  can  help  improve 
processes.  Of  course  statistical  methods  are 
only  a  part  of  the  quality  Improvement  efforts 
and  processes  do  not  Involve  just  goods.  Some 
estimates  show  that  in  excess  of  85%  of  all 
employees  are  actually  in  the  service  area. 

There  are  many  opportunities  for  improving 
quality  and  therefore  productivity  in  the 
service  area. 

The  new  emphasis  on  quality  is  affecting  the 
way  management  deals  with  their  employees. 

There  is  a  new  awareness  of  the  employees’  roles: 
to  know  their  job,  and  to  get  their  job  done 
more  effectively.  The  annual  National  Ouallty 
Month  is  one  indication  from  Congress  and  the 
President  of  the  importance  of  this  area. 

Other  activities  such  as  the  American  Statistical 
Association’s  Committee  on  Quality  and  Produc¬ 
tivity,  are  of  course  concerned  with  smaller 
audiences  but  still  show  some  commitment  from 
ASA. 

The  software  business  is  booming.  The  "Direc¬ 
tory  of  Software  for  Quality  Assurance/Quality 
Control"  in  the  March  1985  issue  of  Quality 
Progress ,  listed  118  packages.  Almost  all  of 


Computers  are  also  used  in  industry  and  in 
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them  have  some  statistical  components.  Quite  a 
few  of  the  packages  are  strictly  statistical 
packages  such  as  SPSS,  MINITAB,  etc.  What  does 
this  mean?  There  is  a  renewed  interest  in  sta¬ 
tistical  computing.  The  proliferation  of  sta¬ 
tistical  software  is  Important  because  it  is  an¬ 
other  sign  of  the  beginning  of  the  understand¬ 
ing  that  statisticians  and,  more  importantly, 
statistical  methods  can  really  help.  W.  Edwards 
Deming  says  that  American  management  has  to 
change.  Even  though  it  is  only  a  small  part  of 
the  transformation  process,  the  use  of  sta¬ 
tistical  methods  is  nevertheless  part  of  the 
process.  Management  is  faced  with  making 
meaningful  decisions  in  the  face  of  uncertainty 
and  variation.  Using  the  scientific  method  to 
get  meaningful  information  upon  which  to  base 
some  of  their  decisions  is  beginning  to  be 
recognized  as  a  valid  approach.  Statisticians 
and  statistical  methods  can  help  managers  make 
decisions  in  a  scientific  manner. 

2.  WORKSTATIONS 

Why  should  the  statistical  computations  be 
implemented  in  a  workstation  environment? 

There  are  a  couple  of  key  points  here.  The 
order  Is  not  important.  One  is  the  prolifer¬ 
ation  of  microcomputers.  1  do  not  have  recent 
estimates  of  the  number  of  microcomputers  at 
CSU  but  I  suspect  that  it  is  probably  upwards 
of  a  thousand  at  this  point.  In  the  spring  of 
1984  the  estimate  was  in  the  neighborhood  of 
400  with  new  orders  at  about  80  a  month.  Un¬ 
fortunately  our  statistics  department  is  not 
expanding  in  microcomputers  as  rapidly.  Never¬ 
theless  the  growth  is  dramatic. 

Another  reason  why  a  workstation  environment 
makes  sense  is  that  people  who  use  computers 
have  more  than  one  task  to  do.  Although  they 
tend  to  be  focused  around  one  speciality, 
computer  users  find  themselves  using  word 
processors,  statistical  packages,  and  wanting 
to  do  lots  of  different  tasks  at  a  computer. 

The  idea  behind  a  workstation  is  to  put  togeth¬ 
er  all  of  the  tools  necessary  to  help  a  user 
perform  any  number  of  tasks.  Thus  it  is  an 
important  design  concept  to  make  workstations 
simple  for  workers  to  use.  Workstations  are 
being  looked  at  as  an  effective  way  to  get  the 
Job  done.  Computers  are  not  a  substitute  for 
good  hard  thinking  or  good  creative  work.  Dr. 
Deming  discusses  what  he  calls  "instant  putting 
solutions";  that  is,  any  solution  to  a  problem 
that  is  easy — not  necessarily  cheap  but  easy  to 
do.  Some  look  at  computers  as  being  an  effective 
way  to  make  better  quality  products  and  to  in¬ 
crease  people's  quality  and  productivity. 

Deming  is  convinced  computers  will  not  replace 
good  and  creative  thinking.  Workstations 
should  be  viewed  by  management  as  one  potential 
tool  for  improving  the  quality  and  productivity 
of  the  workers.  Of  course,  the  cost  versus 
benefit  of  using  any  tool  must  be  evaluated. 

As  someone  said,  if  the  only  tool  you  have  is  a 
hammer,  it  is  surprising  how  many  problems  look 


like  nails. 

Another  Justification  for  workstations  involves 
the  concept  of  networking  of  computing  resources. 
The  networking  concept  involves  more  than  Just 
sharing  computer  peripherals  such  as  printers 
and  plotters.  Ideally  networks  of  computing 
devices  will  free  the  user  from  having  to  make 
decisions  about  which  computer  offers  the 
proper  environment  for  today's  tasks.  A  net¬ 
work  system  should  provide  simple  ways  for 
users  to  communicate  with  many  computers  without 
having  to  know  many  different  protocols.  Con¬ 
sider  my  own  situation.  Currently  I  am  working 
on  the  IBM  XT  in  the  Stat.  Lab.,  I  have  a 
Macintosh  at  home,  1  use  the  CSU  CDC  CYBER 
mainframe  computer  for  many  statistical  appli¬ 
cations,  I  are  on  the  Engineering  College  col¬ 
lection  of  VAX's  and  I  recently  tried  SAS  on 
the  Vet  Hospital's  Data  General.  It  is  mind 
boggling  to  try  to  remember  all  the  different 
protocols  to  get  on  all  of  these  machines.  One 
potential  advantage  for  a  network  environment 
at  CSU  is  that  interfacing  to  the  various 
computers  could  be  much  simpler.  You  would  not 
have  to  remember  anything  except  the  protocol 
for  the  one  machine  you  prefer  to  use.  The 
computer  network  would  interface  to  all  the 
others.  If  one  machine  needs  a  caret  C  or 
whatever,  the  network  systems  could  remember 
that  and  take  care  of  it  for  vou. 

Finally  workstation  environments  will  abound  be¬ 
cause  the  suppliers  of  these  systems  will  con¬ 
vince  us  through  advertisement  that  we  cannot 
do  without  their  systems.  This  reason  may 
actually  dominate  all  the  others.  Why?  Be¬ 
cause  software  vendors  are  going  to  make  a  lot 
of  money  on  workstation  software.  Vendors  are 
just  beginning  to  push  the  concept  of  integrated 
packages.  The  workstation  environment  is  a 
step  beyend  several  integrated  packages.  In 
this  environment  almost  all  tasks  which  we  would 
like  to  do  on  a  computer  are  "integrated"  togeth- 


What  are  some  of  the  essential  components  of  a 
workstation?  Consider  the  following  three  cate¬ 
gories:  the  design  objectives  for  a  workstation, 
the  necessary  hardware  ingredients,  and  the 
necessary  software  ingredients.  At  a  conference 
sponsored  by  SIGNUM  of  ACM  in  March  1984  I  heard 
a  presentation  by  John  K.  Wooten  of  the  Computing 
Division  of  Los  Alamos  National  Laboratory.  His 
talk  touched  on  the  first  two  areas  above. 
Blending  my  experience  as  a  project  investigator 
and  consultant  for  Hewlett  Packard  with  recent 
visits  to  AT&T  Bell  Laboratories,  discussion 
with  those  at  previous  Interface  Conferences, 
reading  articles  on  the  topic  and  considerable 
thinking,  I  have  prepared  the  following  lists 
under  the  three  categories  mentioned  above. 

Consider  first  the  design  objectives  which  an 
organization  should  have  when  considering  how  a 
workstation  should  ideally  be  used. 


V  V. 


Design  Objectives 
for  a  Workstation  Environment 

To  be  most  effective,  workstations  should 
be  used  throughout  an  organization. 

The  Interface  must  be  user  friendly. 

Certainly  job  specific  software  will  be 
needed  and  must  be  available  shortly  after 
introduction  to  the  worker. 

Good  response  time  Is  essential. 

The  hardware  and  software  must  be 
compatible  with  other  equipment  already 
In  place. 

The  hardware  and  software  must  be  expanda¬ 
ble  and  upgradeable  as  new  developments 
come  on  line. 


Inexpensive  printer  close  by  and  peripherals 
such  as  a  laser  printer,  hard  disk  with 
large  storage,  plotters  connected  to  your 
phone  for  all  forms  of  communications 

The  next  list  Is  the  software  characteristics 
which  should  be  designed  Into  a  workstation 
system.  Although  each  of  the  Items  could  be 
described  in  great  detail  this  will  not  be  done 
for  two  reasons.  First,  since  most  readers  of 
this  paper  will  have  a  general  Idea  of  what  Is 
meant  by  each  of  the  characteristics,  the  author 
does  not  Intend  to  create  an  argument  on  semantics. 
Secondly,  all  of  our  definitions  will  be  likely 
to  change  as  we  view  new  approaches  to  software 
development.  Therefore  this  list  is  merely 
Included  to  suggest  the  general  characteristics 
td)lch  should  be  considered. 

Software  Characteristics 
For  A  Workstation  System 


The  user  must  be  able  to  program  In  one  or 
more  languages  but  the  user  should  not 
have  to  program  to  use  the  equipment. 

There  must  be  software  for  office  auto¬ 
mation  such  as:  word,  text,  and  compo¬ 
sition  processors;  file  organizer;  Infor¬ 
mation  retrieval  system  Interface; 
electronic  mall;  Inventory  control 
modules;  data  communication  links;  data 
base  management  systems;  graphics 
presentations;  ledger  analysis  packages; 
etc. 

Since  many  different  data  bases  exist  In 
an  organization  the  system  must  be  able  to 
access  them. 

Through  network  environments  or  whatever, 
one  should  be  able  to  share  resources  such 
as  peripherals. 

The  list  of  hardware  Ingredients  which  follows 
may  be  lacking.  I  do  not  claim  any  particular 
wisdom  here.  Then  too  if  we  wait  a  week  or  two 
the  list  will  probably  change. 

Some  Hardware  Ingredients 
For  A  Workstation 

Full  Bit-Mapped  Screen  of  sufficient  size 
to  be  read  more  than  a  foot  or  two  away 

Good  Resolution,  color  graphics  both  on 
the  screen  and  a  graphics  output  device 
(may  be  at  a  remote  site) 

A  simple  keyboard 

A  cursor  control  device  such  as  a  mouse 
Multiwindow  screen  capability 
Considerable  ram  perhaps  1  to  1.5  megabytes 
Considerable  local  storage,  10-20  megabytes 


The  operation  must  appear  to  the  user  to  be 
FRIENDLY. 

The  system  should  appear  to  the  user  to  Do 
Harder  Tasks  Simply. 

The  system  ought  to  remember  what  has  been 
done  before  using  what  is  often  referred 
to  as  Intrlp  System. 

To  the  extent  that  it  Is  needed  Help 
Features  should  be  available. 

There  ought  to  be  effective  ways  to  Allow 
the  Sophisticated  User  to  Move  Quickly  In¬ 
side  the  System. 

The  system  should  provide  for  Repeatable 
Work  with  minimal  user  specification. 

As  the  science  of  Artificial  Intelligence 
develops  the  workstation  system  should  in¬ 
corporate  some  of  the  better  features. 

The  workstation  system  should  provide  for 
Multitasking  both  in  the  CPU  and  on  the 
display. 

The  user  should  be  able  to  develop  User 
Specified  Procedures/Routines  which  can  be 
called  up  In  the  future. 

3.  SCENARIO  OF  RESEARCH  ON  A  WORKSTATION 

Consider  for  a  moment  how  an  engineer  or  a 
scientist  might  use  a  workstation  environment  to 
solve  a  problem.  The  notion  to  keep  in  mind  Is 
that  the  tasks  which  I  am  describing  can  be 
performed  at  one  station.  The  researcher  is 
sitting  at  his/her  desk.  The  researcher  has 
been  confronted  with  a  problem.  The  first  thing 
you  might  want  to  think  about  is  to  formulate 
the  initial  concepts  associated  with  a  possible 
solution,  organize  and  develop  Ideas,  and  save 
those  things  for  future  use.  (See  Exhibit  1  for 
a  list  of  tasks  and 'workstation  tools  to  be  used.) 


You  might  use  a  word  and  text  processor  and  an 
idea  processor.  You  will  need  a  file  organizer 
to  save  all  the  ideas  for  the  next  steps. 

Before  you  chance  reinventing  the  proverbial 
wheel  you  might  want  to  perform  a  literature 
search  using  one  of  the  several  available  in¬ 
formation  retrieval  systems.  Once  the  search 
is  complete  the  results  will  be  saved.  At  this 
point  you  should  be  ready  to  formulate  the 
proposed  research  objectives  and  prepare  a  draft 
including  the  budget  and  other  financial  impli¬ 
cations.  The  tools  involved  in  this  step  are 
word  and  text  processors,  a  financial  modelor, 
and  a  spread  sheet  package. 

The  draft  is  submitted  via  .electronic  mail  for 
peer  evaluation  followed  by  a  possible  revision. 
Once  approval  has  been  obtained  it  is  necessary 
to  check  on  the  availability  of  the  equipment 
and  supplies  to  be  used  in  the  experiment.  If 
this  information  is  not  immediately  at  hand  one 
could  use  the  Inventory  control  and  order  proces¬ 
sing  components  of  the  workstation.  Indeed, 
since  others  may  wish  to  use  the  equipment, 
the  requirements  should  be  noted  through  the 
network  environment  so  others  will  not  make  claim 
on  the  equipment# 

Using  an  experiment  design  package  the  researcher 
is  assisted  in  making  final  decisions  about  which 
factors  to  use,  the  levels  of  those  factors,  and 
the  type  of  design  to  be  run.  The  hardware  Is 
interfaced  with  the  appropriate  instrumentation, 
the  order  of  the  experimental  design  is  randomized 
and  the  experiment  is  performed. 

We  should  mention  here  that  at  several  of  these 
steps  we  do  not  expect  immediate  response  from 
the  system.  For  example,  the  time  involved  to 
complete  all  experimental  runs  may  be  several 
days  or  weeks.  The  Important  thing  to  remember 
though  is  that  we  can  expect  that  the  user  at 
his  workstation  will  be  receiving  information, 
when  appropriate,  on  the  progress  of  the  experi¬ 
mentation. 

The  data  as  received  are  stored  in  a  data  bases 
management  system  and  verification  procedures 
are  performed  continuously.  At  various  stages 
the  meta  data  are  input  to  the  data  bases 
management  system.  Meta  data  are  essentially 
all  the  non-numerical  information  associated 
with  the  data  base  that  you  would  like  to 
remember.  Everything  you  might  record  in  a  lab 
book  which  normally  gets  lost  when  you  input  the 
results  to  the  computer  can  be  saved  as  meta 
data. 

The  researcher  completes  the  various  data  manipu¬ 
lation  operations  such  as  handling  missing  values, 
transformations,  sorting,  merging,  etc.  most 
likely  in  a  data  base  management  system.  At  this 
point  you  are  ready  for  appropriate  statistical 
analyses  Including  exploratory  analyses  on  the 
data.  There  could  be  many  steps  involved  here. 

The  process  should  be  iterative  and  augmented 


with  statistical  graphics  as  well. 

After  completing  the  analyses  the  researcher 
will  need  to  prepare  some  graphics  for  presen¬ 
tation  of  the  results.  These  can  be  done  in  the 
statistical  graphics  package  or  perhaps  in  a 
graphics  presentation  package  specifically 
designed  for  high  resolution  graphics.  A  ledger 
analysis  or  spread  sheet  package  can  be  used  for 
suBmarizing  the  final  accounting  for  the  report. 
We  complete  the  written  report  on  a  word  and  text 
processor,  develop  slides  for  presentation  of  the 
results,  and  give  the  oral  report  to  management 
throughout  the  corporation.  The  presentation 
may  be  a  real  time  "dog  and  pony”  show  on  the  CRT 
screen  to  the  various  managers  and  colleagues 
who  need  to  know  the  results. 

Finally  the  researcher  saves  all  of  the  results 
in  a  file  organizer  for  future  reference.  Sub¬ 
sequently  the  researcher  reads  his  mall  and 
discovers  a  new  project  awaiting  him.  Or  per¬ 
haps  the  previous  project  needs  to  be  studied 
under  new  conditions.  The  point  is,  of  course, 
that  the  workstation  environment  can  perform  a 
myriad  of  tasks — all  accomplished  at  one  loca¬ 
tion.  Note  also  that  only  a  few  of  che  tasks 
involve  statistical  operations.  The  workstation 
environment  must  allow  a  complex  array  of  tasks 
to  be  performed.  From  the  user's  point  of  view 
the  operation  should  appear  to  be  blended  togeth¬ 
er.  The  resources  used  in  one  task  should  be 
available  to  other  portions  without  great  effort 
on  the  part  of  the  user. 

4.  ADDITIONAL  FEATURES  FOR  THE  STATISTICAL 
COMPONENTS 

There  are  a  few  specific  additional  features 
which  should  be  part  of  the  statistical  compo¬ 
nents  of  a  workstation.  These  are  in  addition 
to  those  software  characteristics  discussed  in 
section  2.  The  software  must  be  user  friendly 
both  for  the  beginner  and  the  experienced  user. 
Many  will  experience  their  first  use  of  sta¬ 
tistical  methods  in  a  workstation  environment. 

It  is  therefore  important  that  their  experience 
with  statistical  analyses  be  friendly.  By  this 
we  mean  that  at  whatever  level  of  complexity, 
the  operation  of  the  workstation  should  appear 
to  be  straightforward. 

Of  course  we  expect  that  the  statistical  compo¬ 
nents  should  offer  comprehensive  and  complete 
solutions  for  the  task  selected.  The  software 
should  be  powerful.  The  statistical  analyses 
should  cover  a  wide  range  of  types  of  situations. 
And  of  course  we  expect  that  the  results  should 
be  correct  statistically  and  numerically. 

Three  special  operations  are  quite  important  for 
the  statistical  components  in  a  workstation 
environment.  First,  the  system  should  allow  the 
user  to  branch  back  up  through  the  path  of  the 
analysis  and  choose  another  route.  The  system 
must  remember  what  has  been  done  before  and 
allow  the  user  to  try  new  routines.  Secondly, 


the  system  should  offer  repeatable  sessions  in 
which  the  user  can  request  similar  paths  through 
the  analysis  with  perhaps  a  different  selection 
of  variables  and/or  subsets.  And  third,  the 
system  should  allow  the  user  to  customize  his 
or  her  own  steps  through  the  data  analysis. 

The  sequence  of  operations  and  decisions  which 
are  made  could  be  given  a  procedure  name  and 
requested  subsequently  by  that  name. 

Finally  it  is  imperative  that  the  statistical 
and  other  components  incorporate  graphics  into 
every  segment  of  the  routines.  In  particular 
the  statistical  coroponenfs  should  have  graphics 
which  are  fine-tuned  to  the  analyses  and  inte¬ 
grated  into  all  components  of  the  software. 
Furthermore,  the  user  interface  should  most 
likely  be  graphical  in  nature  with  pull-down 
menus,  pop-out  windows,  etc.  In  "The  Visual 
Mind  and  the  Macintosh",  Benzon  [1]  describes 
why  he  believes  the  visual  mind  is  now  recog¬ 
nized  as  being  so  important  in  user  interfacing. 
While  his  article  focuses  on  the  Apple  Macintosh 
computer,  most  of  his  remarks  would  also  apply 
to  other  operating  systems  and  software  as  well. 
Indeed  those  vendors  and  software  developers 
who  do  not  make  use  of  the  left  side/right  side 
characteristics  of  the  brain  are  missing  an 
important  way  to  interface  with  the  user. 

5.  ONE  POSSIBLE  SCREEN  IMPLEMENTATION 

Let  us  consider  how  the  user  interface  for  the 
statistical  components  might  be  implemented  in 
an  integrated  workstation.  The  hardware  will 
have  to  Include  a  high  resolution  screen.  A 
color  screen  would  be  helpful  but  is  not  es¬ 
sential.  We  will  need  to  control  the  cursor 
with  either  a  mouse  or  some  other  type  of  control¬ 
ler.  The  mouse  is  my  preference  at  this  time. 

There  are  several  characteristics  of  the 
operating  environment  to  be  mentioned.  The 
user  should  not  have  to  remember  a  lot  of 
commands  to  start  the  system.  The  start-up 
sequence  is  often  a  frustrating  exercise  for 
most  novice  users.  Of  course  it  can  be  frus¬ 
trating  even  to  the  experienced  use  if  he/she 
must  remember  the  sequence  for  several  different 
machines.  Typically,  smaller  machines  are 
easier  to  use,  but  this  is  not  always  true. 

We  want  to  have  simple,  easy-to-understand 
displays.  In  some  of  the  packages  we  have  been 
evaluating  at  CSU  the  initial  display  is  very 
difficult  to  understand.  The  operations  ought 
to  be  simple  enough  so  that  the  user  does  not 
need  a  manual.  We  like  pull-down  menus  which 
lead  to  pop-out  menus.  We  have  discovered  that 
most  users  like  to  have  the  ability  to  fill  in 
the  answer  blanks  on  a  screen.  One  must  have  a 
fairly  sophisticated  computer  to  be  able  to  move 
a  cursor  around,  fill  in  answers,  and/or  check 
off  various  options.  Up  to  this  point  the 
argument  has  been  that  you  need  to  provide  a 
command  language  interface  for  the  more  sophisti¬ 
cated  users.  After  all,  the  story  goes, these 


users  will  want  to  move  around  rapidly  in  this 
software.  The  paper  by  Velleman  and  Lekowitz 
in  these  Proceedings,  however,  suggests  that 
even  sophisticated  users  can  use  a  mouse  inter¬ 
face  more  efficiently  than  a  command  language 
approach.  More  research  needs  to  be  conducted 
on  this  topic  but  the  preliminary  results  are 
encouraging. 

The  ability  to  view  and  operate  on  multi-windows 
on  the  screen  is  essential.  The  windows  will 
naturally  overlap.  Thus  many  different  events 
can  be  shown  on  the  screen  at  the  same  time.  We 
should  be  able  to  page  through  the  windows.  A 
data  window  will  more  than  likely  contain  more 
information  than  can  reasonably  be  displayed  on 
the  screen.  Paging  is  essential. 

The  system  ought  to  support  multi-processing 
which  is  visible  on  the  screen.  For  example, 
the  results  of  an  analysis  might  be  displayed 
in  one  window  while  the  user  cycles  through  the 
data  in  another  window,  and  a  scattergram  is 
created  in  another  window.  In  addition  we  might 
expect  background  processing  to  occur  while 
other  operations  are  displayed  on  the  screen. 

Other  features  include  a  help  operation  without 
tears.  On  some  systems  once  you  enter  the  help 
sequence  the  system  essentially  sets  aside  the 
current  operations  and  branches  to  some  other 
part  of  the  program.  You  may  have  to  recyle 
through  the  entire  operation  again  to  get  back 
to  where  you  were.  Another  feature  which  has 
already  been  mentioned  would  allow  one  to  back 
up  the  steps  in  the  analysis,  make  changes,  and 
start  down  another  path.  Finally  we  require 
user  defined  routines.  The  system  should  allow 
us  to  specify  a  particular  sequence  of  events 
and  identify  that  sequence  as  ours. 

During  the  oral  presentation  of  this  topic,  the 
author  mentioned  that  because  he  has  been  associ¬ 
ated  with  a  project  with  Hewlett  Packard  that 
is  not  complete  at  this  point  and  his  wife  is  a 
consultant  for  IBM,  he  decided  to  "show"  one 
possible  implementation  on  a  Macintosh  computer- 
the  computer  they  have  at  home.  All  of  the 
characteristics  such  as  pull-down  menus  and 
pop-out  menus  were  described  during  the  oral 
presentation.  Another  concept  which  was 
described  is  an  event  window  at  the  bottom  of 
the  screen.  In  this  window  events  are  displayed 
such  as  the  time,  date,  elapsed  time  for 
certain  events,  busy  signal  for  disk  I/O,  and 
status  of  operations  such  as  multiprocessing  of 
computations  with  nonlinear  regression.  In  ad¬ 
dition,  the  author  described  several  other 
features  such  as  how  one  might  scroll  through 
various  windows,  enlarge  or  shrink  windows,  and 
telescope  or  magnify  portions  of  a  window. 

6.  CHALLENGES  FOR  THE  FUTURE 

The  concern  has  been  raised  that  many  people  in 
the  future  may  first  learn  about  statistical 
methods  on  a  workstation.  Their  first  exposure 
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Co  comprehensive  sets  of  statistical  Cools  may 
be  when  they  sit  down  at  their  workstation. 

The  idea  is  a  little  frightening.  Assuming  that 
people  will  first  encounter  statistical  methods 
this  way,  it  means  that  the  quality  of  the  sta¬ 
tistical  software  is  paramount.  The  software 
developers  have  a  more  important  responsibility 
in  workstation  environments.  And  statisticians 
have  a  responsibility  to  make  sure  that  the 
software  developers  produce  quality  products. 

Some  might  question  why  statisticians  should  be 
involved;  after  all,  computer  scientists  will 
more  than  likely  be  doing  the  software  develop¬ 
ment.  Sure,  the  computer  scientist  will  design 
the  systems  and  they  ought  to  do  so.  However, 
statisticians  should  help  the  computer  scientist 
in  the  following  areas:  in  defining  the  depth, 
breadth,  and  completeness  of  the  statistical 
coverage;  determining  the  algorithms  to  be  used 
for  the  computation;  reviewing  the  user  inter¬ 
face  with  regard  to  at  least  the  terminology 
used;  supplying  the  test  data  sets;  and  evalu¬ 
ating  the  overall  performance  of  the  software. 

We  must  realize  that  there  will  be  new  forms  of 
statistical  software  in  the  future.  One  can 
speculate  that  the  way  computer  packages  now 
interact  with  the  user  will  be  considered  **old 
fashioned"  in  three  years  or  less. 

Statistical  methods  in  the  future  will  be 
changing  as  well.  Large  data  sets  will  be  more 
prevalent.  Rapid  arrival,  on-line  data  col¬ 
lection  will  be  commonplace.  New  types  of  data 
analyses  to  accommodate  large  multivariate  data 
sets  will  be  needed.  We  will  no  longer  be 
satisfied  with  simply  giving  our  clients 
analyses  of  variance  tables.  They  will  need 
and  expect  far  more  from  the  statistics  packages. 

Many  believe  that  there  will  be  a  dramatic 
emphasis  on  the  use  of  good  and  insightful  sta¬ 
tistical  graphics.  Certainly  the  hardware  can 
display  the  graphics.  It  will  be  up  to  the 
developers  to  integrate  statistical  graphics 
throughout  the  routines.  The  American  Sta¬ 
tistical  Association  will  shortly  have  a  new 
section  of  Statistcal  Graphics.  The  statistics 
profession  obviously  feels  graphics  are  impor¬ 
tant.  Statisticians  have  a  chance  to  "show" 
users  of  our  methodology  that  statistics  really 
can  help.  Statistical  graphics  may  be  our  best 
tool. 

The  supercomputers  with  various  forms  of  parallel 
processors  may  indeed  change  the  type  of  problems 
we  consider  to  be  statistical  in  nature.  This 
topic  should  receive  more  attention  at  the  future 
Interface  Conferences. 

Finally,  one  subject  for  the  future  which  may 
affect  how  users  do  statistical  methods  is 
Artificial  Intelligence  (AT).  After  asking 
several  people  for  their  definitions  of  AI  and 
receiving  somewhat  different  an  "^‘rs  from  each, 
someone  finally  said  that  a  progL tm  which  could 
recognize  information  never  specifically  program¬ 
med  and  draw  inferences  and  conclusions  would 


have  AI  features.  While  the  concept  may  be 
plausible,  the  reality  may  make  us  wonder.  Is 
It  possible  to  encode  the  knowledge  systems  of 
a  brilliant  statistician  such  as  John  Tukey  into 
software  so  that  the  user  will  have  the  benefit 
of  Tukey*s  help  on  the  user's  problem  (smart 
system) ?  And  can  the  system  carry  the  process 
further  so  that  even  though  the  smart  system 
has  not  seen  the  user's  problem  before  it  will 
lead  the  user  through  decision-making  processes? 

wow; 

7.  CONCLUSIONS 

Small  computers  will  be  Increasingly  involved 
In  all  aspects  of  our  lives.  Our  children  will 
begin  learning  how  to  use  computers  in  elemen¬ 
tary  school  and  can  reasonably  expect  to  use 
them  throughout  their  lives.  Employees  will  use 
computers  and  computer  technology  on  the  job. 
Indeed  many  employers  may  Install  identical 
computers  in  their  employees'  homes  so  that  they 
can  follow  up  on  good  ideas  even  while  they  are 
at  home.  Whether  supplied  by  their  employer  or 
not,  most  homes  in  the  future  will  use  computers 
for  a  wide  range  of  tasks.  We  can  expect  that 
many  tasks  have  not  even  been  envisioned  today. 
The  computer  revolution  may  not  have  even  ar¬ 
rived.  Perhaps  we  are  only  at  the  dawn  of  the 
revolution.  We  do  not  fully  appreciate  the 
place  which  computers  will  have  in  societies  of 
the  future . 

There  is  every  indication  that  statistical 
methods  will  be  even  more  important  in  the 
future.  The  renewed  emphasis  on  improving 
quality  and  productivity  is  helping.  Because 
everything  we  do  can  be  thought  of  as  a  process 
which  needs  continuous  improvement,  recog¬ 
nition  of  the  proper  use  of  statistical  methods 
should  increase  greatly.  These  are  great  times 
for  statisticians.  Statistical  software  will 
continue  to  proliferate  and  change.  Some  feel 
that  the  software  developers  may  help  to  lead 
statistical  methods  into  the  Zlst  century. 

Workstations  are  but  one  result  of  high  tech¬ 
nology  which  should  affect  our  lives  in  a 
positive  way.  The  statistical  components  in¬ 
corporated  in  these  workstations  would  be  impres¬ 
sive.  Statisticians  need  to  get  involved  to 
make  sure  that  this  happens. 
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Exhibit  1 

Scenario  of  Research  Done  on  a  Workstation 

Step  Tasks  Workstation  Tools 

1.  Formulate  initial  Word  &  Text  Processor 

concept,  organize  (WTP),  Idea  Processor 

&  develop  ideas,  (IP)  &  File  Organizer 

&  save  for  future  (FO) 


2. 

Perform  literature 

Information  Retrieval 

search  &  save 

Systems  and  FO 

3. 

Formulate  proposed 

WTP,  Financial 

research  objectives 

Modeler  &  Spread¬ 

&  prepare  draft. 
Including  financial 
Implications  &  budget 

sheet  package 

It. 

Submit  to  peers  for 

Electronic  Mail  & 

evaluation,  criti¬ 
cism  &  revise 

WTO 

5. 

Check  on  availabil¬ 

Inventory  Control  & 

ity  of  equipment 
to  be  used 

Order  Processing 

6. 

Decide  on  appropri¬ 

Exper.  Design 

ate  experimental 
design,  etc. 

Routines 

7. 

Interface  the 

Data  Comm.  Linkage 

Instrumentation, 

Rand,  and  Data  Base 

randomize  the  runs. 

Management  systems 

run  exp. ,  &  store 
results  in  data  base 

.(DBMS) 

8. 

Perform  any  number 
of  data  verification 
procedures 

DBMS 

9. 

Input  Meta  data  to 
data  base 

DBMS 

10. 

Complete  data  mani- 
operations  such  as 
MV's,  transform 
sorting,  merging, 
etc. 

DBMS  or  Stat.  Library 

11. 

Perform  appropriate 

SL  which  should 

stat.  analysis  in¬ 

include  many  Stat. 

cluding  EDA.  Note: 
Many  steps  involved 
here 

Graphics  (SG) 

12. 

Prepare  graphics 

SG  &  Graphics  Presen¬ 

for  presentation 
results 

tation  Package 

13. 

Prepare  final 

Ledger  analysis 

accounting  summary 

package  &  spread¬ 

of  cost  vs  benefits 

sheet  package 

14. 

Complete  written 
report  on  results 

Word  &  text  processor 

15. 

Develop  slides  for 
oral  presentation 
of  results 

Graphics  Presentation 

16. 

Give  oral  report  to 

Real  Time  Oral  '’Dog 

management  through¬ 
out  corporation 

&  Pony  Show**  on  CRT 

17. 

Save  all  results 
for  future 

File  Organizer 

18. 

Begin  new  project 

Electronic  Mall  & 

or  read  mail  to 
discover  what  the 
boss  wants  next 

then  go  to  step  1 
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The  personal  workstation  is  rapidly  emerging  as  a  powerful  tool  for  conducting  data  analysis,  particularly  in 
contrast  to  either  the  large  mainframe  or  the  small  personal  computer.  This  talk  describes  some  user  experiences 
in  working  with  a  variety  of  workstations  and  in  providing  data  analysis  software  for  them,  especially  for 
graphical  display  of  data.  The  discussion  includes  the  present  state  and  desirable  future  evolution  of 
workstations  from  the  viewpoint  of  statistical  applications. 


I.  A  Computer  for  Every  Data  Analyst 

Recent  trends  in  computer  technology  have  caused 
drastic  changes  in  the  price  of  hardware.  At 
present,  a  workstation  computer,  may  be  purchased 
for  approximately  $15,000,  and  the  trend  in  price  is 
still  definitely  downward;  soon  such  machines  will 
be  priced  near  $5,000.  With  prices  at  this  level,  it 
will  not  be  long  before  any  serious  data  analyst  will 
be  able  to  afford  a  personal  workstation. 

When  we  speak  of  a  workstation,  we  mean 
something  quite  different  from  the  type  of  machine 
currently  called  a  “personal  computer”.  The 
“personal  computer”  is  generally  characterized  by 
slow  processor  speed,  limited  internal  storage 
capacity,  and  small  amounts  of  external  storage. 
At  present,  most  of  these  machines  are  based  upon 
processors  with  either  8-bit  or  16-bit  architectures; 
this  naturally  limits  the  amount  of  memory  that  the 
machine  can  address.  The  current  personal 
machines  are  also  limited  in  software.  Although 
there  are  numerous  small  programs  for  these 
machines,  large,  integrated  systems  are  not 
normally  present.  Typically,  they  have  very 
primitive  operating  systems,  and  because  of 
hardware  constraints,  the  operator  is  often  forced  to 
load  and  unload  programs  manually  as  the  need 
arises. 

The  personal  workstation  is  very  different,  indeed. 
The  workstation  is  a  “real  computer”,  not  unlike 
the  super-mini  machines  designed  for  scientific 
processing.  A  workstation  typically  has  a  32-bit 


processor,  at  least  1  megabyte  of  main  memory, 
and  disk  storage  capacity  in  the  tens  of  megabytes. 
Also,  workstations  normally  have  modern  operating 
systems  and  sophisticated  software.  They  arc 
capable  of  running  the  application  systems  and  user 
programs  written  for  mainframe  time-shared 
computers,  without  the  competition  for  resources 
and  the  usage  costs  associated  with  time-sharing. 
Workstation  users  do  not  suffer  due  to  inadequacies 
of  hardware  or  software.  In  fact,  the  workstation 
opens  new  opportunities  for  the  development  of  an 
environment  which  emphasizes  the  human 
interface. 

Of  course,  it  is  appropriate  to  ask  “Why  not  a 
time-shared  mainframe?”  The  answer  to  this  is 
that  the  workstation  gives  the  analyst  control  over 
the  computing  resources  necessary  for  the  job.  The 
price  of  a  mainframe  typically  means  that  it  is 
controlled  by  a  group  that  may  not  be  responsive  to 
the  need  for  modern  data  analytic  software.  Also, 
since  the  processor  is  shared,  work  for  other  users 
may  interfere  with  the  processing  power  needed  for 
data  analysis. 

The  Human  Interface 

There  are  several  characteristics  of  the  workstation 
which  have  a  major  impact  on  its  human  interface. 
These  are  high-resolution  display  devices,  provision 
for  user  control  of  multiple  processes,  interaction, 
and  networking.  The  display  is  often  a  bitmap 
raster-scan  device,  with  resolution  of  approximately 
100  pixels  per  inch.  This  relatively  high  resolution 


enables  the  display  to  produce  approximations  to 
typeset  documents,  with  various  fonts  and  point 
sizes.  It  also  can  produce  quite  satisfactory 
graphical  displays  and,  since  the  local  processor  can 
change  the  bitmap  rapidly,  the  display  can  give  the 
appearance  of  continuous  motion. 

One  of  the  major  uses  of  such  a  display  device 
involves  the  creation  of  “windows”  on  the  screen. 
Each  window  acts  as  an  independent  connection  to 
the  processor,  much  as  multiple  timesharing 
terminals  could  be  running  on  a  large,  shared 
machine.  Any  particular  window  can  run  a  process 
that  is  tailored  to  a  specific  task,  such  as  producing 
graphical  displays,  text  editing,  or  document 
display. 

The  user  has  control  over  the  workstation  not  only 
through  a  conventional  keyboard,  but  also  through 
a  visual  interface.  Dynamic  interaction  with  the 
display  is  carried  out  through  a  mouse  (or  touch 
screen,  tablet,  light  pen,  etc.)  that  enables  the  user 
to  point  at  the  display,  draw,  and  make  menu 
selections.  The  combination  of  pointing  device  and 
good,  fast  graphics  makes  menu-style  interfaces  to 
application  software  much  more  attractive. 

Because  certain  peripheral  computer  facilities  are 
expensive  or  infrequently  used,  workstations  use 
local  area  networks  to  provide  access  to  them.  At 
one  extreme,  workstations  can  be  used  as  very 
intelligent  terminals  to  current  mainframe 
computers. 

Statistical  Computing  in  the  New  Environment 

How  will  a  workstation  environment  affect 
statistical  computing?  Major  impacts  will  be  made 
by; 

•  Local  Power 

•  Graphics 

•  Dynamic  Displays 

•  Multiple  Windows 

•  Interaction 

•  Networking 

2.  Local  Power 

The  availability  of  large  amounts  of  essentially  free 
computing  power  is  likely  to  change  the  way  that 
data  analysis  is  done.  Once  a  workstation  is 
available,  it  costs  nothing  to  have  it  computing. 
Therefore  more  processor  intensive  data  analytic 
techniques  are  likely  to  be  attempted.  Techniques 


like  the  bootstrap,  in  which  thousands  of  similar 
analyses  may  be  carried  out  in  order  to  find 
confidence  limits,  are  likely  to  be  more  common.  In 
general,  simulation  techniques  are  much  more  likely 
to  be  used. 

However,  perhaps  the  more  important  contribution 
of  the  local  processing  power,  is  that  it  will 
encourage  the  analyst  to  consider  more  analyses. 
Data  analysis  is  a  process  that  does  not  have  one 
fixed  answer;  often  it  is  important  to  come  up  with 
several  different  views  of  the  same  data.  When  the 
analyst  is  able  to  look  at  the  data  from  many 
different  viewpoints,  without  having  to  incur  large 
processing  costs  or  to  share  his  machine  with  others 
(and  suffer  degraded  performance),  the  quality  of 
the  analysis  is  likely  to  improve. 

3.  Graphics 

No  longer  will  computer  graphics  be  limited  to 
those  able  to  alford  expensive  equipment.  With  the 
advent  of  workstations,  every  user  will  have  graphic 
capabilities.  Graphical  techniques  have  long  been 
known  as  powerful  aids  to  data  analysis.  The 
human  mind  is  far  superior  to  any  computer 
software  in  the  area  of  pattern  recognition.  When 
shown  a  scatter  plot,  a  human  data  analyst  can 
recognize  curvature,  clustering,  and  a  host  of  other 
interesting  characteristics  of  the  data  displayed. 
The  combination  of  interactive  graphic  displays 
with  an  interactive  computing  environment  will 
provide  a  synergistic  effect,  leading  again  to  better 
data  analysis. 

Perhaps  a  less  obvious  benefit  of  graphics  will  be 
the  ability  of  using  graphical  symbols  to  aid  user 
interaction.  Just  as  international  road  signs  use 
pictures  to  guide  automobile  drivers,  so  will 
computers  be  able  to  use  non-verbal  graphic 
images,  known  as  icons,  to  guide  data  analysts. 

4.  Dynamic  Displays 

Static  graphical  displays  have  always  been  available 
to  people  who  want  to  look  at  data.  Many  of  the 
displays  common  today  were  invented  in  past 
centuries.  Much  of  the  research  into  new  methods 
of  displaying  data  involves  dynamically  changing 
pictures.  This  can  involve,  for  example,  movie-like 
sequences  of  views  of  a  point-cloud.  A  good 
example  of  such  research  is  described  in  PRIM9 
(Fisherkeller,  1974),  ORION  (Friedman,  1982), 
and  PRIMH  (Donoho,  1982). 

At  AT&T  Bell  Laboratories,  we  have  experimented 
with  a  number  of  these  dynamic  displays.  Most  of 


this  research  was  done  on  a  Teletype  5620  Dot- 
Mapped  Display  terminal  (which  is  basically  a 
diskless  workstation).  We  have  rotating  point 
clouds,  a  straightedge  display  that  moves  under 
control  of  the  mouse,  dynamic  display  of  identifiers 
on  a  scatterplot,  and  a  more  advanced  technique  for 
multivariate  data,  known  as  “brushing"  (see  Becker 
and  Cleveland,  1984). 

Dynamic  displays  can  also  be  used  for  the 
presentation  of  several  distinct  but  related  pictures 
in  alternation,  the  process  called  alternagraphics  by 
Tukey  (1982).  Given  a  multi-plane  graphics  color 
display  terminal  with  a  color  map  (such  as  the 
Advanced  Electronics  Design  Model  512),  it  is 
possible  to  rapidly  cycle  through  pre-computed 
scenes.  Such  displays  are  not  slowed  down  by  their 
complexity,  but  have  a  limited  number  of  views  and 
precomputation  overhead.  We  have  used  this 
technique  to  show  rotation  of  3-dimensional 
surfaces  with  perspective  generated  by  stereo 
glasses.  We  have  also  looked  at  the  behavior  of 
smoothers  as  a  locality  parameter  was  varied. 

Another  use  of  local  processing  power  in 
conjunction  with  dynamic  displays  is  in  fast¬ 
changing  displays.  For  example,  it  should  be 
possible  to  plot  the  data  in  a  univariate  regression 
problem  and  to  interactively  move,  delete,  or  add 
points  to  the  plot  and  to  see  the  regression  line 
continuously  updated.  We  could  also  choose  power 
transformations  for  the  x-  and  y-  variables  on  a 
scatter  plot  by  observing  the  picture  as  the 
transformation  powers  were  changed  under  control 
of  a  graphical  input  device. 

5.  Multiple  Windows 

Since  workstations  allow  the  user  to  control 
separate  activities  from  separate  windows,  a 
number  of  difficulties  of  current  statistical  software 
melt  away.  For  example,  it  becomes  easy  to  allow 
the  user  to  interact  with  the  statistical  software  in 
one  window  (either  through  a  keyboard  or  menus), 
to  see  graphical  results  in  another  window,  and  to 
get  on-line  assistance  at  the  same  time  in  another 
window.  The  size,  shape,  and  position  of  the 
windows  can  reflect  the  users  wishes,  and  they  can 
be  rearranged  at  any  time. 

6.  Interaction 

Since  workstations  normally  provide  hardware  and 
software  facilities  for  user  interaction,  there  is 
much  flexibility  in  the  face  that  statistical  software 
presents  to  the  user.  Dynamically  changing  menus 


can  be  provided;  menus  can  “pop-up”  on  the 
display  until  the  user  makes  a  selection,  and  then 
can  disappear;  icons  can  be  used  for  non-verbal 
interaction.  Multiple  windows  allow  users  to 
explore  on-line  documentation  or  pursue  any  other 
background  computing  they  like,  without 
interrupting  or  removing  from  the  display  the 
current  interaction. 

7.  Networking 

Networking  is  one  method  for  providing  a  number 
of  workstations  with  shared  resources.  However, 
networking  facilities  will  do  much  more  for  the 
statistician.  Workstation  networks  are  often 
configured  as  in  Exhibit  1 . 


Local 

Diiks 


Data  transmissions  around  the  local  network  are 
typically  very  fast,  often  several  megabits  per 
second.  At  these  speeds,  users  can  share  data, 
documentation  and  software  without  experiencing 
any  loss  in  apparent  performance  because  a 
particular  file  is  actually  at  a  remote  location.  As 
the  figure  suggests,  relatively  expensive  and 
infrequently  used  resources  (fast  printers,  hardcopy 
plotters,  very  large  disks,  special  processors)  can  be 
connected  to  the  network  and  used  by  all  the 
workstations,  without  seriously  slowing  down  access 
to  them.  The  ability  to  connect  the  local  network 
to  other  networks  and  to  other  types  of  computing 
environment  is  particularly  important.  Users  need 
not  sever  their  links  to  the  conventional  mainframe 


computing  world,  from  which  much  of  the  data  for 
analysis  will  continue  to  come. 

The  personal  effects  of  the  networking  environment 
are  at  least  as  important  as  the  technical  effects. 
Electronic  communication,  both  local  and  remote,  is 
one  of  the  most  fundamental  changes  being  made 
by  the  current  computer  revolution.  Workstations 
linked  by  local  and  remote  networks  give  the  user 
full  access  to  this  communication.  For  example, 
the  UNIX'  system  provides  both  one-to-one 
communication  (e.g.,  mail  and  write  commands) 
and  broadcast  communication  (news  and  netnews). 
The  style  of  communication  stimulated  by  these 
facilities  is  qualitatively  different  from  traditional 
paper  communication,  emphasizing  rapid  response 
and  brief  documents.  In  many  ways  it  is  more 
communication  in  contrast  to  the  publication  mode 
typical  of  paper  documents.  The  publication 
process  itself  is  also  mightily  changed  in  the 
workstation  environment:  the  hardware  and  user 
interface  facilities  allow  authors  to  interact  with  the 
editing,  design  and  production  process  much  more 
directly. 

Modern  Software  for  Data  Analysis 

Once  workstation  hardware  is  available,  it  becomes 
necessary  to  think  of  appropriate  software  for  the 
new  environment.  Of  course,  it  will  not  be 
sufficient  to  use  old  batch  software  inherited  from 
the  1960s,  or  to  think  of  the  display  screen  as  a  fast 
line  printer.  In  addition,  it  must  be  remembered 
that  hardware  evolution  will  continue,  and  hence 
the  software  must  be  adaptable  to  tomorrow’s 
hardware. 

Most  people  think  of  statistical  computations  such 
as  regression  or  transformations  when  thinking  of 
statistical  computing.  However,  there  is  much 
more  involved  than  that.  The  software  must  be 
able  to  store  and  retrieve  data,  work  with  a  wide 
variety  of  data  structures,  and  provide  interactive 
graphics  on  various  graphical  devices  (which,  like 
workstations,  have  proliferated  rapidly  and  are 
continually  undergoing  change). 

S  is  a  system  which  is  meant  to  fulfill  the  needs  for 
modern  data  analysis  software.  It  runs  under  a 
number  of  versions  of  the  UNIX  operating  system 
on  a  variety  of  hardware.  S  is  described  in  a  recent 
book  by  Becker  and  Chambers  (1984). 

1.  UNIX  is  a  Trademark  of  AT&T  Bell  Laboratories. 


The  primary  goal  for  S  is  to  allow  users  to  perform 
good  data  analysis.  Judging  from  the  experience  of 
some  thousands  of  users,  S  satisfies  this  goal  quite 
well.  However,  in  order  for  people  to  be  able  to  use 
S  to  analyze  data,  they  must  have  access  to  S. 
Hence  it  is  desirable  to  have  S  readily  available,  on 
inexpensive  but  appropriate  hardware. 

Luckily,  the  general  trend  in  computer  hardware  is 
for  more  power  at  less  cost,  and  the  current 
selection  of  professional  workstations  is  the 
manifestation  of  this  trend.  Modern  workstations 
combine  computing  power,  large  amounts  of 
addressable  memory,  and  quick  and  consistent 
response  time,  and  often  come  with  the  UNIX 
operating  system.  Many  of  these  workstations  also 
have  provision  for  bitmap  graphic  displays.  These 
machines  not  only  provide  an  excellent  environment 
for  S,  but  they  also  have  the  potential  for  providing 
better  understanding  of  data  through  dynamic 
graphic  displays.  These  new  UNIX-based 
workstations  are  a  desirable  environment  for  S 
because  of  their  low  price,  good  graphics  (bitmap, 
dynamic),  and  responsiveness.  We  now  have 
experimented  with  S  on  the  following  workstations: 
Sun,  Hewlett-Packard  9000,  AT&T  3B2,  and 
Wicat.  The  machines  run  a  variety  of  UNIX 
systems,  including  AT&T  System  V  and  Berkeley 
4.2BSD. 

We  have  had  experience  in  porting  S  to  the 
following  machines  and  variants  of  the  UNIX 
system: 

Hardware  Operating  System 

HP  Series  200  HP-UX  (System  III) 

(MC68000) 

SUN  100  (MC68010)  4.2BSD 

3B2-300  (WE32000)  System  V 

Wicat  150  (MC68000)  7th  Edition, 

System  V 

Perkin-Elmer  32/30  7th  Edition 

HP  Series  500  HP-UX 

(HP  32-bit  chip)  (System  III) 

Apollo  AEGIS 


IBM  370  UNIX/370 

DEC  VAX  11/780  32V 

DEC  PDP  11/70,  11/45  7th  Edition 

Pyramid  System  V,  4.2BSD 

Ridge  4.2BSD 

DEC  VAX  11/780,  750  BSD  4.1, 

4.1c,  4.2 


DEC  VAX  1 1/780  System  III,  V 

DEC  VAX  1 1/780,  750  8th  Edition 
AT&T  3B20S  System  V 

When  we  first  wrote  S  for  the  UNIX  system,  one  of 
the  major  decisions  we  made  was  the  basic  choice 
of  programming  language.  Because  of  the  large 
amount  of  FORTRAN  computational  code  already 
available,  we  decided  to  use  that  language. 
However,  we  decided  that  the  primitive  operations 
of  the  S  system  should  be  implemented  in  C.  C 
provides  the  natural  linkage  with  the  underlying 
UNIX  operating  system  calls. 

Conclusions 

The  statistical  computing  arena  is  undergoing  a 
quiet  revolution.  In  the  near  future,  increased 
computing  power,  good  graphics  and  new  modes  of 
human  interaction  will  be  available  to  a  greatly 
increased  population  of  potential  users  of  statistical 
systems.  Such  users  will  benefit,  and  indeed 
require,  high-quality  on-line  help  in  using  statistical 
software.  Fortunately,  the  personal  workstation  is 
well  suited  to  provide  such  help.  Its  resources  are 
essentially  free  to  the  user,  encouraging  the 
approach  that  as  much  effort  as  needed  should  be 
spent  by  the  computer  in  presenting  data 
dynamically  and  in  supporting  interaction  with  the 
user. 

The  statistician  will  also  find  many  new 
opportunities  in  such  an  environment.  The 
computer  power  should  greatly  increase  the  use  of 
simulation  as  a  routine  tool,  whenever  the  behavior 
of  a  model  or  estimate  needs  to  be  studied.  In  the 
choice  of  theoretical  work  in  statistics,  as  well,  the 
statistician  with  a  real  concern  for  the  healthy 
practice  of  data  analysis  will  find  new  challenges  in 
providing  support  for  this  new  user  population.  For 
example,  graphical  presentation  of  data,  diagnostics 
of  value  to  the  non-professional  analyst  and  more 
advanced  techniques  such  as  expert  systems  are  all 
exciting  possibilities  in  the  new  environment. 
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Multidimensional  scaling  Is  an  often  used  technique  with  many  similarities  to 
factor  analysis.  This  paper  discusses  and  compares  several  models  for  multidimensional 
scaling,  and  gives  some  generalizations  of  some  of  these  models.  It  proposes  new  (to 
multidimensional  scaling)  fitting  criteria,  and  compares  the  results  obtained  by  their 
use.  Some  solutions  to  problems  encountered  in  the  optimization  algorithms  are 
discussed.  Finally,  some  statistical  implications  of  multidimensional  scaling  models 
are  given. 


1.  Introduction 


In  the  general  multidimensional  scaling  (MDS) 
problem  the  data  consists  of  one  or  more  dis¬ 
similarity  matrices,  where  a  dissimilarity  la 
some  measure  of  distance,  and  the  matrices  give. 
In  some  sense,  the  distances  between  the  objects 
(or  stimuli)  considered.  An  easy  example  of 
such  a  matrix  la  the  mileage  distances  between 
cities  often  found  on  road  maps.  Here,  the 
distances  between  cities  Is  the  dissimilarity 
measure,  and  the  MDS  problem  Is  to  locate  the 
cites  In  a  two  (or  three)  dimensional  space 
based  upon  these  distances.  As  a  second, 
more  complicated  example,  consider  the  purely 
fictitious  data  In  Table  1.  In  this  example,  the 
stimuli  represent  7  stores  and  the  dissimilarity 
Is  a  ranking  of  the  distances  In  each  row  of  the 
distance  matrix.  Rather  than  using  the  actual 
distances,  the  ranks  of  the  distances  are  used 
as  the  dissimilarity  measure.  From  row  1  of  the 
table,  one  can  see  that  store  3  la  closest  to 
store  1,  store  2  Is  second  closest  to  store  1, 
store  5  is  third  closest,  etc. 


The  data  is  an  example  of  ordinal  dissimilarity 
data.  In  the  general  MDS  problem,  the  data  can 
be  categorical,  ordinal.  Interval,  or  ratio. 
Also,  since  the  ranks  In  one  row  have  no  direct 
relationship  to  the  ranks  In  a  second  row,  each 
row  represents  a  different  stratum  (or  condition¬ 
ality  group)  In  the  sense  that  dissimilarities 
in  the  two  rows  cannot  be  compared.  In  the 
general  MDS  problem,  more  than  one  dissimilarity 
matrix  can  be  observed,  and  a  stratum  can  be 
a  row  of  a  dissimilarity  matrix,  an  entire 
dissimilarity  matrix,  or  all  of  the  data. 
These  strata  correspond  to  what  Is  called  row 
conditional,  matrix  conditional,  or  unconditional 
data,  respectively. 


The  Idea  In  multidimensional  scaling  Is  to  locate 
objects  In  a  t-dlmenslonal  Euclidean  space  In 
such  a  manner  that  the  agreement  between  the 
observed  dlsslmllartles  and  the  distances 
predicted  by  the  location  of  the  objects  In  the 
space  is  In  some  sense  optimal.  In  this  example, 
the  usual  Euclidean  distance  given  by 


Table  1 


Store  Distances 


Store  Ranks 


Is  used,  where  Is  the  coordinate  of  the  1-th 
object  In  the  k-th  of  f  dimensions  in  the 
Euclidean  space.  (The  matrix  consisting  of  the 
Xj|<  Is  called  the  configuration  matrix.) 
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Generally,  a  criterion  function  of  some  form  Is 
used  to  obtain  an  optimal  solution.  In  this 
case,  the  criterion  function  Is  given  by: 
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Table  2 


where  n  Is  the  number  of  stimuli,  S*  denotes 
the  optimal  dissimilarities,  called  dispar¬ 
ities  (see  below),  and  Is  the  predicted 
distance  given  above. 

The  criterion  function  Is  optimized  with  respect 
to  both  the  configuration,  through  6,  and  the 
disparities  5*.  If  the  data  la  ratio  or  Interval, 
the  disparities  6*  are  the  observed  distances 
and  there  Is  no  optimization  with  respect  to  4*. 
In  ordinal  data  the  disparities  are  the  predicted 
dissimilarities,  where  the  prediction  Is  made 
via  a  monotonlo  regression  of  the  ranks  of  the 
observed  data  on  the  predicted  distances  6 
within  each  stratum.  Finally,  In  categorical 
data,  the  disparities  5*  are  the  average  of  all 
predicted  distances  4  which  have  the  same  observed 
dissimilarity  within  a  stratum. 

The  numerator  In  the  above  expression  la  the 
least  squares  criterion.  The  denominator  Is  a 
normalizing  factor  which  prevents  the  solution 
from  becoming  degenerate  In  ordinal  (or  categ¬ 
orical)  data  (a  different  criterion  might  be 
used  in  ratio  and  Interval  data).  The  denominator 
is  required  here  because  the  optimization  Is 
with  respect  to  both  4  and  4*.  If  the  denominator 
were  not  present,  q  could  be  made  as  small  as 
desired  simply  by  simultaneously  scaling  both  4 
and  4*. 

As  an  example  of  the  monotonlo  regression  used 
in  ordinal  data,  consider  the  following  table: 


Ranks 

for 

Store 

7 

Store 
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5 
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Rank 

1 

2 

3 

A 

5 

6 
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.69 

.65  . 

•  AA 

1.10  1 

.07 

1  .23 

4* 

.59 

.59  . 

,59 

1  .08  1 

.08 

1  .23 

In  this  table  the  original  rankings  of  the 
dissimilarities  for  each  store  compared  to  the 
ranks  for  store  7  are  given  In  the  second  row. 
(The  data  Is  presented  In  Its  original  rank 
order.)  Using  the  estimated  configuration,  the 
distances  In  the  third  row  are  computed.  In 
computing  the  disparities,  note  that  In  the 
third  row  .65  Is  less  than  .69,  .AA  Is  less  than 
.65,  and  1.07  is  leas  than  1.10.  Since  the 
disparities  must  be  In  the  same  rank  order  as 
the  observed  dissimilarities  In  the  second  row, 
the  monotonlc  regression  averages  the  elements  in 
the  third  row  as  required  In  order  to  preserve  the 
originally  observed  ordering  In  the  disparities 
4*  given  in  the  fourth  row. 

When  the  criterion  function  Is  optimized,  the 
resulting  configuration  Is  given  in  Table  2.  A 
plot  of  these  results  is  given  In  Figure  1 ,  with 
a  plot  of  the  store  locations  which  gave  rise  to 
the  rankings  in  Table  1  presented  In  Figure  2. 
In  comparing  these  figures  note  that  the  scale 
is  meaningless  since  no  distances  were  actually 


The  Configuration  X 
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observed.  The  differences  In  the  location  of 
the  stores  in  these  two  figures  come  about 
because  of  the  lack  of  uniqueness  of  the  estimated 
configuration  In  ordinal  (or  categorical) 
data.  (For  that  matter,  note  that  the  store 
locations  given  in  Figure  1  are  not  unique,  as 
an  Infinite  number  of  such  plots  could  have 
given  rise  to  the  same  rankings  In  Table  1 ,  even 
after  eliminating  variation  due  to  reflections 
and  rotation.) 


Figure  1 

The  Estimated  Configuration 
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The  fact  that  the  estimated  configuration  in 
Figure  2  is  not  unique  (even  after  allowing  for 
changes  in  sign  and  for  rotations)  can  be  seen 
as  follows:  If  the  fit  is  perfect,  then  the 
numerator  is  0.0  and  the  denominator  has  no 
effect.  One  can  then  change  the  configuration 
in  such  a  manner  that  rankings  in  the  ordering 
of  the  distances  are  unchanged.  The  monotonlc 
regression  will  then  change  the  disparities  so 
that  they  are  exactly  equal  to  the  new  distances, 
and  the  numerator  In  the  criterion  function 
remains  0.0. 


Figure  2 


The  Actual  Store  Locations 


In  section  2  the  general  criterion  function 
(and  thus,  the  model)  used  by  subroutine  MSCAL 
In  the  IMSL  library  Is  described,  along  with 
possible  generalizations.  (Subroutine  MSCAL 
will  be  released  In  the  next  edition  of  the 
libraries.)  Section  3  describes  the  methods 
used  for  fitting  the  model,  while  section  four 
gives  a  more  complicated  example. 


2.  The  General  Criterion  Function 


The  general  criterion  function  In  subroutine 
MSCAL  Is  given  as  follows: 
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the  user  to  make  various  assumptions  about  the 
distribution  of  the  observed  dissimilarities. 
This  is  clearly  most  Important  In  ratio  or 
interval  data,  but  it  also  has  effects  in 
ordinal  and  categorical  data,  primarily  through 
the  weights  Since  least  squares  and  maximum 
likelihood  estimates  are  equivalent  (In  one 
stratum)  when  the  distribution  of  the  transformed 
random  variables  are  normal,  the  function  f  may 
be  used  as  a  transformation  to  normality.  This 
la  equivalent  to  using  f  to  obtain  homogeneous 
variances  within  each  stratum. 

Choices  for  f  In  MSCAL  are: 


x2 

f(x)  -  X 

log(x) . 


If  one  believes  that  squared  distances  have 
constant  variance  (and  are  approximately  normally 
distributed),  then  f(x)»x2  should  be  used. 
Similarly,  f(x)  •  x,  or  f(x)  «■  log(x)  should  be 
used  If  these  transformations  yield  constant 
variance. 

The  squared  transformation  Is  the  transformation 
used  In  the  ALSCAL  program  of  Takane,  Young,  and 
DeLeeuw  (1977),  while  distances  are  used  In 
MULTISCALE  (Ramsey,  (1983)),  and  KYST  (Kruskal, 
Young,  and  Seery  (1973)).  among  others,  and  log 
distances  are  allowed  in  MULTISCALE. 

The  Distance  Models 

The  models  for  the  distances  R''®  equivalent 
to  those  used  In  iLSCAL.  They  are  given  as 
f ol lows : 


The  Euclidean  model: 


Ijm 


I 


k*l 


ik 


2 


where  u)^  depends  upon  the  fCsJjp,)  In  the  h-th 
stratum,  h  indexes  the  strata,  f  Is  a  trans¬ 
formation  discussed  below,  and  are  constants 
to  be  estimated  In  some  models,  m  indexes  the 
S’.'bjects  and  will  depend  upon  i,  J,  and  h 
according  to  the  stratification  used,  and  p 
allows  for  Lp  estimates  other  than  least  squares 
to  be  used  in  the  criterion  function.  (The  most 
likely  values  for  p  are  2.0  for  least  squares 
and  1.0  for  least  absolute  value.) 

Null  and  Sarle  (1982)  also  suggested  a  criterion 
function  involving  p-th  power  estimates  for  use 
with  ratio  and  interval  data.  MSCAL  allows 
categorical  and  ordinal  data  as  well  as  ratio  and 
I nter va  1 . 

The  function  f  in  the  criterion  function  allows 


The  individual  differences  model: 


ijm 
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where  is  the  weight  on  the  k-th  dimension  for 
the  m-th  subject. 


The  stimulus  weighted  model: 
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Where  is  the  weight  on  the  k-bh  dimension  for 
the  1-th  stimulus. 


iJ^ 


The  stimulus  weighted  individual  dirferences 
model: 


*  I  W  .  >•  S,,  «(X,,  -  X,,  ) 
Ijra  mk  Ik  Ik  Jk 


Other  distance  models  are  possible.  For  example. 
Instead  of  a  weighted  apace,  one  could  allow  a 
rotation  of  each  individual's  coordinate  axis. 
This  yields  the  IDIOSCALE  model  of  Carroll  and 
Chang  (1970).  Additionally,  one  could  allow  for 
asymmetric  models  via  the  skew  symmetric  matrices 
of  Weeks  and  Bentler  (1982).  Future  refinements 
of  the  MSCAL  subroutine  may  allow  for  such 
refinements. 

The  Strata  Weights 

In  metric  scaling,  strata  weights  are  used  to 
weight  the  observations  within  a  stratum.  In 
this  case,  weights  which  are  Inversely  proport¬ 
ional  to  the  variances  are  preferred  because 
such  weights  lead  to  normal  distribution  theory 
maximum  likelihood  estimates.  Thus,  in  metric 
scaling,  one  would  use 


(1977).  For  the  configuration  this  amounts  to 
obtaining  the  average  of  the  product  moments 
matrices  (double  centering  the  dissimilarities), 
computing  the  i'  largest  eigenvectors  of  this 
matrix,  and  multiplying  by  the  square  root 
of  the  matrix  of  eigenvalues.  When  subject 
weights  are  required,  the  method  of  Schonemann 
as  modified  by  foung,  Takane,  and  LewyckyJ 
(1978)  la  used.  Finally,  when  stimulus  weights 
are  required,  a  multiple  regression  method  in 
conjunction  with  the  method  of  Schonemann 
is  employed. 


After  the  initial  estimates  are  obtained,  a 
modified  Gauss-Newton  algorithm  is  used  to  obtain 
estimates  of  most  parameters.  In  the  multidimen¬ 
sional  scaling  models  discussed  here,  this 
amounts  to  iteratively  reweighted  least  squares. 
To  speed  convergence,  the  initial  iterations  are 
performed  on  subsets  of  the  parameters,  while 
in  the  final  iterations  all  parameters  but  the 
disparities  are  optimized  simultaneously. 
In  all  iterations,  optimal  values  for  the 
disparities  are  computed  via  a  secant  based 
method  discussed  later. 


-1 

"h 


where  is  the  weight  in  the  h-th  stratum,  the 
sum  is  over  all  observations  in  the  stratum,  and 
Of,  is  the  number  of  observation  in  the  stratum. 

In  nonmetrlc  scaling,  because  the  criterion 
function  is  minimized  with  respect  to  both  6  and 
6*.  the  criterion  function  la  degenerate  unless 
strata  weights  are  used  as  a  normalizing  factor. 
An  optimum  criterion  value  of  zero  could  always 
be  obtained  without  this  normalization.  In  most 
multidimensional  scaling  programs,  normalization 
is  provided  by  the  use  of  one  of  two  possible 
weights  proposed  by  Kruskal  (1969).  These 
weights  are  given  by: 


”h- 


“h= 


All  parameters  appearing  in  the  general  criterion 
function  do  not  have  to  be  used  in  the  multi¬ 
dimensional  scaling.  Thus,  with  some  exceptions, 
the  presence  of  the  subject  weights  W,  the 
stimulus  weights  S,  the  scaling  factor  b^,  and 
the  additive  constant  an,  is  optional.  Moreover, 
any  parameter  matrix  (including  the  configuration 
matrix  X)  can  be  fixed  in  the  optimization 
procedure.  (The  disparities  are  fixed  by 
declaring  the  data  to  be  interval  or  ratio  data.) 

The  initial  iterations  .proceed  as  follows: 

1.  In  nonroetric  scaling,  the  disparities 
estimates  5*  are  computed  within  each  strata 
assuming  that  all  other  parameters  are  fixed. 
The  estimates  of  a^,  and  b^  within  each  stratum 
are  also  computed  at  this  time. 

2.  The  optimal  conf iga^'atlon  estimates  (X) 
are  computed. 

3.  The  optimal  subject  weights  estimates 
(W)  are  computed  (one  subject  at  a  time). 

9.  The  optimal  stimulus  weights  (S)  are 
computed. 


where  the  sum  is  over  the  observations  in  the 
h-th  stratum,  and  where  f(A*..)  is  the  average 
of  the  disparities  in  this  stratum. 


3.  Fitting  the  aodel 


Initial  ostlmates  rf  all  parameters  are  obtained 
via  the  same  methods  which  are  employed  in  the 
AUoCAl.  progr'am  of  Young,  Takane,  and  DeLeeuw 


When  the  maximum  change  in  any  parameter  is  less 
than  a  user  specified  constant  ( lOO.OxEPS) ,  the 
iterative  method  changes.  In  the  Iterations  at 
this  point,  steps  2,  3.  and  9  above  are  combined 
SI)  that  optimal  estimates  of  X,  W,  and  S  are 
obtained  simultaneously.  (Note  that  in  metric 
scaling,  the  Hessian  for  all  parameters  is 
computed.  The  Inverse  of  this  matrix  is  commonly 
used  as  an  estimate  of  the  varlanoeoovarlance 
matrix  of  the  parameters.  Some  additional  uses 
of  this  matrix  are  discussed  later.) 


within  each  stratum  is  transformed  to: 


Convergence  is  said  to  have  occurred  when  the 
change  in  any  parameter  from  one  Iteration  to 
the  next  is  less  than  a  user  specified  constant  «  .  f  _  h.rfx  i|P  -  wVI«  |P  - 

EPS.  ^  ^ 


The  Lp  Gauss-Newton  Algorithm 

As  stated  earlier,  a  modified  Gauss-Newton 
algorithm  is  used  in  the  estimation  of  all 
parameters  but  the  disparities  (and  the  parameters 
and  bj,).  This  algorithm,  discussed  by  Merle 
and  Spath  (197*4),  uses  iteratively  rewelghted 
least  squares  on  the  criterion  function.  In 
discussing  this  algorithm,  first  rewrite  the 
criterion  function  as  follows: 
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Least  squares  Is  then  used  on  a  linearization  of 
the  parameters  in  to  obtain  the  estimates. 
In  this  least  squares  estimation.  It  Is  assumed 
that  (1)^  and  the  denominator  of  q  are  fixed. 
(I.e.,  for  each  observation,  uif,  and  the  denom¬ 
inator  of  q  are  combined  to  yield  an  observation 
weight  which  la  fixed  with  respect  to  the 
iteration.)  The  only  problem  occurs  when  p<2 
and  the  denominator  Is  zero,  at  which  time  a 
division  by  zero  would  occur.  In  this  situation, 
the  denominator  Is  set  to  0.001,  and  the  calcul¬ 
ations  then  proceed  as  usual. 

Estimating  the  Disparities 

I.  Ordinal  data 

As  was  discussed  earlier.  In  least  squares  MDS 
monotonic  regression  Is  used  In  the  computation 
of  the  disparities  In  ordinal  data.  Because  p-th 
power  estimates  are  computed,  these  methods 
cannot  be  used  (when  p  Is  not  2.0),  because  they 
would  not  yield  optimal  estimates.  A  severely 
modified  monotonlc  regression  must  be  used 
Instead.  Within  each  stratum  the  criterion 
function  Is  given  by: 


where  b  is  the  scaling  parameter  bj,  for  this 
stratum  (a^,  la  not  used),  and  X  is  the  Lagrange 
multiplier. 

Within  each  stratum  the  criterion  function 
Involves  parameters  y()<),  b,  and  X  in  this  phase 
of  the  optimization.  Because  of  the  monotonlc 
restrictions  on  the  y(|<).  It  is  not  easily 

possible  to  use  the  usual  Newton-Raphson  tech¬ 
niques  on  all  parameters  simultaneously. 
Because  of  X  and  the  second  terra  In  the  criterion 
function,  simple  modification  of  the  usual 
monotone  regression  techniques  may  not  be 
employed.  The  following  algorithm,  while 

sometimes  slow  to  converge,  seems  to  yield  the 
optimal  estimates  (Kuhn-Tucker  theory  guarantees 
that  the  estimates  are  optimal  if  convergence 
occurs. ) : 

1 .  Set  X  and  b  to  0.0. 

2.  Estimate  yji^j  for  the  criterion  q  holding 
X  and  b  fixed. 

3.  Estimate  b. 

A.  Estimate  X. 

5.  If  the  change  In  any  parameter  from  one 
iteration  to  the  next  Is  greater  than 
EPS,  go  back  to  step  2. 

In  step  2  a  secant  algorithm  is  used  to  compute 
each  isotonic  parameter  y(|<)  based  upon  the 
observations  f(6(|<))  associated  with  the  param¬ 
eter.  The  y(|<)  are  made  monotone  by  restricting 
all  y’s  which  would  otherwise  violate  the 
monotonlc  restriction  to  be  equal.  This  has  the 
effect  of  Increasing  the  number  of  observations 
which  are  used  in  the  Lp  location  estimate  of 
the  restricted  parameters.  For  example,  the 
monotonlolty  restrictions  may  require  that  the 
rankeu  transformed  disparities  y(2)  through  y{y) 
be  equal.  One  would  then  compute  the  estimate 
of  these  6  parameters  as  the  Lp  location  estimate 
of  the  transformed  distances  f(S(2))  through 

f(«(7)). 


^  ^  ^  l^(k)  ■  ^^*(k)’  I'* 

^  l^k)!” 

Where  y(i<)  =*  ^  rank  of  the 

observed  dissimilarity  in  Its  stratum  .  (k  is 
enclosed  in  parenthesis  to  emphasize  this 
ranking.)  In  this  equation,  the  y(k)  3^® 
parameters,  while  in  this  phase  of  the  optimiz¬ 
ation,  it  is  assumed  that  the  6(k)  ^re  fixed. 
The  monotonic  assumption  in  ordinal  data  requires 
that  the  y(])  <  y^2)  1  ^(3)  •••  1  y(s)»  where  s 
is  the  number  of  observations  in  the  stratum. 

Using  Lagrange  multipliers  the  criterion  function 


In  step  3*  a  secant  algorithm  for  fixed  y{[()  and 
A  is  used.  The  computation  of  A  in  step  ^4  is 
direct,  and  is  obtained  by  setting  the  derivative 
of  q  with  respect  to  each  y()^)  equal  to  zero,  and 
then  summing  over  all  possible  y(k)» 

The  algorithm  seems  to  converge  for  all  values  of 
p  in  the  Interval  [1,2].  Convergence  is  slowest 
for  p  near  1  and  is  fastest  at  p»2. 

II.  Categorical  data 

In  categorical  data  the  p-th  power  estimate  of 
location  is  used  on  the  transformed  distances  as 
the  disparity  estimate  for  all  observations  with 
the  same  observed  dissimilarity  within  each 
stratum.  A  secant  algorithm  is  used  to  compute 


the  p-th  power  location  estimate. 


Table  3 


Because  the  Hessian  Is  computed  In  full  In  metric 
scaling,  a  case  analysis  (also  called  a  residual 
analysis)  can  be  performed.  Clearly,  one 
quantity  of  Interest  for  each  observed  dissimil¬ 
arity  Is  Its  residual.  Some  measure  of  Influence 
may  also  be  of  Interest,  as  well  the  observation 
weight  and  its  standardized  residual.  These 
statistics  may  be  computed  as  follows: 


1  .  Compute  the  observation  weight  and 
residual  In  the  usual  manner, 

2.  Compute  the  Influences  as  follows:  Let 

Bljra  denote  the  row  vector  of  weighted  partial 
derivatives  of  the  +  bf,  «  f^^ljm)  “ith  respect 
to  the  parameters  a^,  b^,  and  all  rarameters  In 
Sijmi  3'^''  denote  the  raai  lx  of  these 

partial  derivatives.  Compute  the  Influences  (or 
leverages)  as  the  diagonal  elements  of  the  matrix 
G(G'G)"'G' . 

3.  The  studentlzed  residual  is  given  as: 


r  -  e/SQRT(MSE«(t-h)) 

where  e  is  the  residual,  h  la  the  leverage,  and 
MSE  is  the  (weighted)  mean  square  error  estimate 
computed  via  the  criterion  function  and  adjusted 
for  the  number  of  parameters  estimated. 


Wine  Tasting  Data 
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When  a  multidimensional  scaling  analysis  of  the 
data  Is  performed  using  least  squares  as  the 
criterion  function,  and  the  Individual  differences 
model  for  the  distances,  the  resulting  criterion 
function  Is  given  as  follows  for  each  of  1  , 
2,  3,  and  l|  dimensions. 


T  Criterion 

1  12.599 

2  *1.857 

3  3.1125 

U  1.287 


l|.  An  Example 


As  a  second  example  of  ordinal  row  conditional 
dissimilarity  data,  consider  the  matrix  In  Table 
3  In  which  nine  wines  are  Judged  with  respect  to 
their  dissimilarity  by  one  of  nine  people.  The 
data  consists  of  nine  such  matrices,  one  for 
each  of  the  nine  Judges  Each  person  ranked  the 
dissimilarity  of  the  remaining  eight  wines  with 
the  row  wine.  Thus  the  IJ  element  In  Table  3 
gives  the  ranked  dissimilarity  of  the  J-th  to 
the  1-th  wine,  where  ranking  Is  within  each 
row.  Thus,  In  row  1,  wine  2  Is  Judged  most 
similar  to  wine  1  In  this  table,  while  wine  8  la 
Judged  least  similar.  The  study  was  blind  In  the 
sense  that  no  Individual  knew  the  name  of  the 
wine  being  tasted. 


Because  there  is  a  large  decrease  In  the  criterion 
function  from  1  to  2  dimensions,  and  then  a 
leveling  off  at  2,  3.  and  A  dimensions,  two 
dimensions  are  retained. 


-2.10 


-0.80 


0.80 


Figure  < 

The  Judges'  Weights 


A  plot  of  the  configuration  In  this  two  dimension¬ 
al  solution  Is  given  in  Figure  3i  In  which  the 
scaled  wine  location  Is  the  leftmost  letter  in 
the  wines  name.  A  plot  of  the  subject  weights 
for  each  of  these  two  dimensions  Is  given  In 
Figure  A.  In  Figure  it  note  that  subject  3  gives 
almost  no  weight  to  dimension  1  and  gives 
comparatively  little  weight  to  the  second 
dimension.  This  outcome  can  be  explained  by  the 
fact  that  subject  3  had  a  bad  head  cold  during 
the  Judging.  It  Is  encouraging  that  the  multi¬ 
dimensional  scaling  seems  to  be  picking  up  this 
fact. 

Interpretation  of  the  stimulus  configuration  Is 
difficult.  The  fact  that  the  Gallo  Hearty 
Burgundy  and  the  Gallo  Burgundy  are  closely 
related  Is  encouraging  because  of  their  close 
proximity  on  the  plot.  Also  encouraging  Is  the 
fact  that  the  two  wines  made  primarily  from 
zlnfandel  grapes  are  also  close  together  on  the 
plot.  .Still,  the  meaning  of  each  of  the  two 
dlmenslon.s  Is  difficult  to  Interpret,  especially 
for  one  who  prefers  drinking,  to  learning  about, 
wine. 


5,  Discussion 


Because  the  asymptotic  theory  In  multidimensional 
scaling  Is  complicated  by  the  fact  that  the 
number  of  parameters  Increases  In  most  models 
with  the  number  of  subjects  (see  Ramsey,  1978), 
the  validity  of  all  asymptotic  results  In 
samples  of  even  moderate  size  Is  questionable. 
One  should  also  question  the  validity  of  the 
estimated  variances  and  covariances  and  any 
residual  analysis.  .Still,  some  estimate  of  a 
variance  Is  better  than  none  in  most  oases,  and 
a  residual  analysis  In  metric  data  should  yield 
some  Information. 


The  meaning  of  a  residual  analysis  In  nonmetrlc 
data  Is  not  well  understood,  however.  In  such 
data,  because  of  the  raonotonlo  regression,  the 
residuals  may  not  be  meaningful.  Since  the 
leverages  also  depend  upon  the  residuals,  (and 
In  any  event  do  not  Include  Information  In 
the  disparity  derivatives)  they  may  not  be 
meaningful,  either, 

A  residual  analysis  In  Lp  estimation  also 
needs  to  be  Investigated  more  fully.  Indeed, 
the  validity  and  estimates  of  parameter  variances 
Is  required  when  p  Is  not  2.  Estimates  of 
leverages  also  needs  Investigation. 

The  fact  that  the  estimates  are  not  unique  In 
the  nonmetrlc  scaling  models,  even  after  allowing 
for  sign  changes  and  rotations,  Is  disconcerting. 
This  lack  of  uniqueness  comes  about  because  of 
the  disparity  estimation.  In  classical  nonmetrlc 
scaling,  ordinal  data  becomes  pseudo  continuous 
via  the  monotonlc  regression.  (After  the 
raonotonlo  regression,  the  disparities  are  analyzed 
as  If  they  were  continuous.)  It  seems  that  a 
better  method  would  start  from  the  premise  that 
the  data  are  ranks,  and  compute  the  configuration 
estimates  directly  from  the  premise.  In  this 
regard,  the  MAXSCAL  algorithm  of  Takane  and 
Carroll  (1981)  shows  promise.  This  algorithm 
uses  the  information  In  the  ranks  in  the  same 
way  that  the  Cox  proportional  hazards  model  can 
be  thought  of  using  It,  through  the  marginal 
likelihood. 
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1.  Introduction 

In  this  paper  we  shall  be  concerned 
with  the  elTects  of  near  collinearity  in  the 
linear  model 

y  =  X6  +  c  .  (I.l) 

where  X  is  an  n  Xp  matrix  of  rank  p. 
The  qualification  “near”  is  important,  for 
the  case  where  X  is  exactly  collinear.  that 
Is  where  rank(A')  <  p.  Is  well  under¬ 
stood,  at  least  mathematically.  Here  the 
theory  of  estimation  tells  us  that  the 
model  (I.l)  does  not  contain  enough  infor¬ 
mation  to  estimate  the  vector  of  regres¬ 
sion  coefficients.  The  cure  is  usually  to 
supply  additional  information  in  the  form 
of  identifiability  con.straints  on  b  ,  or  more 
rarely,  when  the  collinearity  results  from 
missing  data,  to  supply  additional  obser¬ 
vations  to  the  model.  Design  matrices  are 
the  most  important  source  of  exactly  col- 
linear  models,  and  the  associated  theory 
usually  provides  a  clue  to  the  appropriate 
fix. 

Near  collinearities,  on  the  other 
hand,  arise  from  various  sources,  and 
their  detection  and  treatment  present  a 
number  of  research  problems  that  have 
not  yet  been  satisfactorily  resolved.  In 
this  paper  we  shall  be  concerned  with 
their  detection.  In  principle  this  problem 
may  be  solved  by  deciding  what  deleteri¬ 
ous  effects  of  collinearity  one  wishes  to 
avoid  and  then  computing  a  measure  of 
these  effects  for  the  problem  at  hand.  If 
the  effects  are  are  acceptably  small,  one 
can  continue  with  the  analysis.  If  not, 
one  must  take  special  action. 


The  chief  il)  effects  of  near  collineari¬ 
ties  are  that  they  inflate  the  variance  of 
the  least  squares  estimate  of  b  and  that 
they  magnify  the  effects  of  errors  in  the 
regression  variables.  In  this  paper  we 
shall  be  concerned  with  how  collinearity 
interacts  with  errors  in  the  variables. 
Because  this  is  an  Interface  conference,  I 
will  start  from  my  own  field  and  consider 
errors  arising  from  rounding  during  the 
computation  of  the  regression  coefficients. 
At  the  end  of  the  paper,  I  will  speculate 
on  the  problem  in  general. 

In  the  next  section,  we  will  introduce 
the  condition  number  of  the  regression 
matrix  X  and  indicate  why  it  may  be 
considered  a  measure  of  collinearity.  The 
condition  number  shares  with  other  meas¬ 
ures  of  collinearity  the  properly  that  It 
changes  when  the  columns  of  the  regres¬ 
sion  matrix  are  scaled.  Tha  chief  techni¬ 
cal  problem  in  this  paper  is  to  find  an 
appropriate  scaling.  In  §3  we  shall 
present  an  argument  for  scaling  so  that 
the  columns  of  the  regression  matrix  have 
the  same  norms  (unit  column  scaling)  and 
show  how  the  results  of  rounding  error 
analysis  vitiate  the  argument.  In  §4  we 
shall  present  a  dllTerent  argument,  based 
on  rounding  error  analysis,  that  also  sup¬ 
ports  unit  column  vsraling.  The  paper 
concludes  with  some  general  observations. 

2.  The  Condition  Number 

The  condition  number  of  a  square 
matrix  was  first  introduced  by  Alan  Turn¬ 
ing  in  1918  to  measure  the  sensitivity  of 
the  solution  of  systems  of  linear  equations 


to  perturbations  in  their  coefficients.  A 
related  condition  number  for  the  solution 
of  linear  least  squares  problems  was  intro¬ 
duced  by  Golub  and  Wilkinson  in  1Q66. 
For  the  regression  matrix  of  (1.1)  the  con¬ 
dition  number  is  defined  as 

mX)=  I(r  iMiJf'il  .  (2  ') 

where 

=={X'^X)^X'^  (2.2) 

is  the  pseudo-inverse  of  X .  Here  ||  -  ||  is 
the  Euclidean  norm  of  a  vector  or  the 
spectral  norm  of  a  matrix;  i.e. 

II  ^  II  =</  II  II  ■  (3.3) 

For  the  properties  of  these  norms  as  well 
as  proofs  of  the  statements  to  follow  in 
this  section,  see  (Golub  and  Van  Loan. 
1083). 

The  connection  of  the  condition 
number  with  collinearity  can  be  made 
clear  by  observing  that  the  condition 
number  remains  unchanged  when  X  is 
multiplied  by  a  scalar.  Consequently,  we 
may  assume  without  loss  of  generality 
that  II  A'  11  =1.  With  this  scaling  the 
reciprocal  of  the  condition  number  has  the 
following  characterizations. 

1.  k'^  s  the  smallest  singular  value  of 

X. 

2.  K  *  =  min  (I  A'6  (I  . 

II »  11=1 

3.  K  *  = 

min{  II  II  :  rank(A'  k  E)  <  p  )■ 

l.et  us  discuss  each  of  these  characteriza¬ 
tions  in  turn. 

Although  the  first  characterization  is 
phrased  as  a  numerical  analyst  might  put 
it,  it  can  easily  be  recast  in  language  that 
a  statistician  would  appreciate.  The 
singular  values  of  X  are  the  square  roots 
of  the  eigenvalues  of  X^X .  Hence  if  #c"* 
is  small.  A’^A'  has  a  small  eigenvalue,  and 
the  inverse  cross-product  matrix  (X^A')  * 


has  a  large  eigenvalue  and  hence  is  itself 
large.  Since  the  largest  element  of  a  posi¬ 
tive  definite  matrix  occurs  on  its  diagonal, 
at  least  one  diagonal  of  the  inverse  cross 
product  matrix  is  large.  These  diagonals 
are  called  variance  inflation  factors,  and 
the  connection  between  collinearity  and 
large  variance  inflation  factors  has  often 
been  remarked  in  the  statistical  literature. 

The  second  characterization  says 
that  If  fc~*  is  small,  then  there  is  a  vector 
b  for  which  A'6  ,  is  small.  In  other  words, 
X  has  an  approximate  null  vector  —  a 
sure  sign  of  near  collinearity. 

The  third  characterization  expresses 
the  relation  between  condition  and  col¬ 
linearity  in  a  very  natural  manner. 
Specifically,  if  is  small  then  a  small 
perturbation  of  X  is  exactly  collinear. 

Although  the  condition  number  is 
closely  related  to  the  notion  of  near  col¬ 
linearity.  it  was  originally  introduced  to 
measure  the  sensitivity  of  least  squares 
coefTicients  to  perturbations  in  the  least 
squares  matrix;  that  is.  the  sensitivity  of 
regression  coefficients  to  errors  in  the  vari¬ 
ables.  The  principle  result  is  the  follow¬ 
ing.  In  the  model  (1.1)  let 

S  =  X'y  (2.3) 


be  the  estimated  regression  coefficients. 
Let 

X  =--X  +  E  (2.4) 

be  a  perturbation  of  X  and  let 

6  =  X'y  (2.5) 


be  the  corresponding  estimated  regression 
coelTiclen  ts.  'I'  hen 


JJX_iJL< 
liHI  “ 


(2.6) 


k{X  )  +  k*(A'  ) 


II  Ml 

II  II  \\t  II 


MJI 
11^  II 


where  i  Is  the  residual  vector  y  Xf> .  A 
dot  has  been  placed  over  the  inequality 
sign  to  Indicate  that  term  in  ||  ||  ^  and 


(2.10) 


higher  powers  have  been  ignored. 

The  left  hand  side  or  (2.6)  represents 
a  relative  error  in  6 ;  as  it  approaches 
unity,  at  least  one  of  the  components  of  S 
must  loose  ait  accuracy.  Likewise,  the  fac¬ 
tor  II  £  II  /  II  JC  II  represents  a  relative 
error  In  X  due  to  the  perturbation  E. 
The  factor  in  the  brackets  is  always 
greater  than  one  and  grows  with  k.  Thus, 
if  K  is  large,  the  regression  coefficients  can 
be  expected  to  be  sensitive  to  errors  In  the 
variables. 

Although  the  condition  number  pro¬ 
vides  a  great  deal  of  insight  into  the 
nature  of  colllnearity  and  especially  into 
its  interaction  with  errors  in  the  variables. 
It  Is  not  much  used  by  statisticians. 
There  are  two  reasons  for  this.  The  first 
is  that  the  right  hand  .side  of  (2.6)  is  usu¬ 
ally  an  overestimate  of  the  actual  error. 
This  Is  not  surprising,  since  the  bound 
was  derived  by  numerical  analysis,  who 
typically  encounter  the  very  small  errors 
caused  by  rounding  on  a  computer  and 
can  therefore  afford  to  use  a  loose  bound. 
On  the  other  hand,  the  errors  in  the  vari¬ 
ables  of  a  regression  problem  may  be  com¬ 
paratively  large,  and  a  loose  bound  may 
cause  the  analyst  to  give  up  on  a  tract¬ 
able  problem. 

The  second  reason,  which  Ls  the  one 
we  shall  be  concerned  with  in  this  paper, 
is  that  the  condition  number  is  not  Invari¬ 
ant  under  scaling  of  the  columns  of  the 
matrix.  To  see  how  this  comes  about  let 
us  partition  X  in  the  form 

X  ={X.x)  (2.7) 

and  define 

:  (2.8) 

that  is.  Xg  is  X  with  its  la.st  column 
scaled  by  the  factor  m. 

Now  as  a  approaches  zero. 

I'm  ||.Y„||  =  IIX.II  >  0.  (2.e) 
On  the  other  hand 


=  [a-'M  ■ 

where  consists  of  the  first  p-1  rows 

of  X*  and  i***  is  the  last  row.  It  follows 
that 

lim  II  Xj  II  =  oo  (2,11) 

and  hence 

limXj  =  oo  .  (2  12) 

o-»0  '  ' 

From  (2.12)  we  see  that  by  scaling  a 
column  of  X  in  a  suitable  manner,  we  can 
make  the  condition  number  as  large  as  we 
like.  One  feels  instinctively  that  there  is 
something  phony  about  this  inflation  of 
the  condition  number,  and  on  this 
account  the  phenomenon  has  been  dubbed 
artiflcial  ill  conditioning.  But  calling 
names  does  not  solve  problems,  and  there 
remains  the  question  of  what  scaling  is 
correct.  We  shall  now  turn  to  this  prob¬ 
lem. 

3.  A  Facile  Argument 

There  is  r»ne  scaiing  which  is  widely 
recommended  in  regression  analysis:  scale 
the  columns  of  X  so  that  they  have  norm 
one.  If  column  means  have  been  sub¬ 
tracted  from  X,  this  scaling  makes  the 
cross-product  matrix  X^X  a  correlation 
matrix;  hence  the  name  correlation  scaling 
Is  sometimes  found  In  the  literature. 
However,  we  do  not  wish  to  oonflne  our 
analysis  to  models  with  a  constant  term, 
and  we  will  instead  refer  to  the  strategy 
as  unit  column  scaling. 

Where  rounding  error  is  concerned, 
there  is  an  easy  argument  in  favor  of  unit 
column  scaling.  It  is  based  on  two  obser¬ 
vations. 

1.  Unit  column  scaling  approximately 
minimizes  the  condition  number. 

2.  Unit  column  scaling  ameliorates  the 
effects  of  rounding  errors  on  com¬ 
puted  solutions. 


The  first  observation  is  due  to  van  der 
Sluice  (1060).  The  second  is  widespread 
throughout  the  statistical  literature  (see 
for  example  (Draper  and  Smith,  1081, 
p.264)).  Together  they  place  unit  column 
scaling  in  the  enviable  position  of  minim¬ 
izing  error  bounds  such  as  (2.6)  while  at 
the  same  time  minimizing  the  effects  of 
rounding  error.  On  could  hardly  ask  for 
more. 

Unfortunately  for  this  argument,  the 
second  observation  above  is  false.  Pro¬ 
vided  that  no  exponent  exceptions  occur 
in  the  calculation  of  the  regression 
coefficients,  the  effects  of  rounding  error 
are  essentially  independent  of  the  scaling. 
The  reader  may  verify  this  for  himself  by 
a  simple  computation.  Take  a  3X2  least 
squares  problem  and  solve  it  in  four 
decimal  digits  on  a  hand  calculator  by 
forming  the  normal  equations  and  solving 
them  with  Gaussian  elimination.  Note 
the  rounding  errors  at  each  stage.  Now 
multiply  the  second  column  by  one  hun¬ 
dred  and  repeat  the  calculations.  Up  to 
scaling  factors  that  are  powers  of  ten. 
exactly  the  same  rounding  errors  will 
occur;  the  effect  of  the  scaling  is  to  scale 
the  rounding  errors,  not  to  change  them. 

More  precisely,  what  is  actually 
shown  by  rounding  error  analysis  is  that 
if  a  numerically  stable  method  is  used  to 
compute  regression  coefficients  then  the 
computed  coefficients  come  from  a  matrix 
A  +  .  where  the  columns  of  E  satisfy 

<  c  10  * 


(3.1) 


j  W  ^  II 

Here  t  is  the  number  of  decimal  digits 
carried  in  the  computation  and  c  Is  a 
constant,  depending  on  n .  p .  and  the 
details  of  the  computer  arithmetic.  If  we 
write  this  bound  In  the  form 


<  c  10 


(3.2; 


then  it  says  that  the  relative  error  in  xy 
introduced  by  rounding  error  depend? 
only  on  the  properties  of  the  computei 


arithmetic  and  not  on  the  initial  scaling  of 
the  column. 

Without  the  second  observation 
above,  the  case  for  equal  column  scaling 
becomes  less  persuasive.  It  is  true  that 
the  scaling  approximately  minimizes  the 
condition  number;  but  minimizing  the 
condition  number  does  not  necessarily 
minimize  a  bound  like  (2.6),  since  a  scal¬ 
ing  that  makes  k  small  may  make 
\\E  II  /  II  A'  II  large.  It  is  only  when  we 
consider  the  error  structure  and  the 
bound  simultaneously  that  we  can  hope  to 
make  meaningful  statements.  We  shall  do 
Just  that  in  the  next  section. 

4.  Rounding  Error  and  Collinearity 

We  begin  this  section  by  observing 
that  the  argument  of  §3  has  a  rather  loose 
character.  The  first  of  the  two  observa¬ 
tions  is  precise  enough:  the  second  is 
vague  and  false.  Hut  even  if  the  second 
were  truly  and  exactly  worded,  the  con¬ 
nection  between  the  two  statements  has 
not  been  made  explicit.  One  feels  that 
there  ought  to  be  a  relation  between  con¬ 
dition  number  and  rounding  error  and 
therefore  what  is  good  for  either  must  be 
good  for  both.  Hut  in  fact  we  have  not 
been  precise  in  staling  what  we  are  about. 

To  circumvent  this  problem  let  us 
focus  on  a  specific  question:  How  does 
near  coUineariiy  enhance  the  effects  of 
rounding  error  on  computed  regression 
coefficients?  In  fact  the  material  to 
answer  this  question  Is  at  hand.  We  have 
a  measure  k(A  )  of  near  collinearity  in  X . 
In  (2  6)  we  have  a  relation  between  col¬ 
linearity  and  accuracy.  Finally,  li  (3.1) 
we  have  the  structure  of  the  error  n.atrix 
E.  when  only  errors  due  to  rounding  are 
considered.  It  will  take  only  one  more 
observation  to  bring  these  together  in 
such  a  wav  as  to  suggest  a  natural  scaling 
for  computing  the  condition  number. 

The  observation  is  that  when  E  is 
due  to  rouiiding,  {|  E  ||  /  ||  A  ||  will  tend 
to  be  independent  of  the  scaling  of  the 


i 


i 


columns  of  X .  To  see  this  first  note  that 
Tor  any  matrix  E ,  ||  ^  |(  < 

n/^  max{  II  Cj  II  }  .  It  Totlows  (3.1)  that 


c  y/p  10  *max{  ||  ||  }  . 

Since  II  A'  II  >  max{  ||  ||  }.  it  follows 

that 

-M-j|-  <  c  10-'  ,  (4.2) 

a  bound  which  is  independent  of  scaling. 

The  argument  is  now  short.  If 

\\  E  II  /  II  .V  II  is  independent  of  .scaling, 
then  we  are  free  to  use  any  scaling  in 

(2.6) .  In  particular,  unit  column  scaling, 

which  tends  to  minimize  k(A^)  will  tend  to 
give  the  best  bound.  In  other  words,  if 
the  condition  number  as  a  measure  of  col- 
linearity  is  to  be  used  to  predict  the  effects 
of  rounding  error  on  regression 

coefficients,  it  should  be  computed  with 
unit  column  scaling. 

The  validity  of  this  statement 

depends  on  the  whether  or  not  the  bound 

(2.6)  and  (4.2)  are  realistic.  We  have 
already  observed  that  although  (2.6)  gives 
away  a  lot.  for  the  small  errors  encoun¬ 
tered  in  rounding  error  analysis  It  Is  prob¬ 
ably  satisfactory.  The  fact  that  the  scale 
independence  suggested  by  (4.2)  obtains  in 
practice  Is  supported  by  the  details  of  the 
rounding  error  analysis  that  generated 
(3.1).  Thus  if  the  condition  number,  com¬ 
puted  with  unit  column  scaling,  predicts 
thnt  the  regression  coefficients  are  satis¬ 
factorily  accurate,  the  result  can  be  taken 
at  face  value. 

5.  Concluding  Remarks  In  the  intro¬ 
duction  of  this  paper  we  said  that  the 
problem  of  detecting  collinearitles  can  be 
resolved 

by  dfridlng  wliat  delrtrrlous  effcc(s 
of  colMnearity  one  ul.shcs  to  avoid 
and  then  computing  a  measure  of 
these  effects  for  the  prol>leiM  at  hand. 

If  the  effects  are  are  acceptably  small. 


one  can  continue  with  the  analysis. 

If  not,  one  must  take  special  action. 

This  technique  of  focusing  on  specific 
problems  is  sound  dogma,  and  the  failure 
to  observe  It  in  §3  lead  to  confusion. 
Only  when  we  posed  a  precise  question  in 
§4  were  we  able  to  obtain  satisfactory 
answers  to  the  problems  introduced  by 
the  effect  of  scaling  on  condition  numbers. 

However,  one  pays  a  price  for  this 
success.  Namely,  one  can  pose  many 
problems,  and  the  answers  may  not  all  be 
compatible.  Let  us  look  at  three  ways  in 
which  our  basic  problem  can  change. 

First  let  us  change  the  problem  of  §4 
by  positing  that  the  model  has  a  constant 
term  but  we  are  not  Interested  in  the 
effects  of  rounding  error  on  the  regression 
coefficient  coefficient  corresponding  to  the 
constant  term.  How  then  should  the  con¬ 
dition  number  be  computed  to  reflect  the 
accuracy  of  the  remaining  coefficients?  A 
careful  analysis  (which  is  beyond  the 
scope  of  this  paper)  will  suggest  that  unit 
column  scaling  should  be  applied  to  the 
original  regression  matrix,  the  matrix 
should  then  be  centered,  and  finally  the 
condition  number  should  be  computed 
from  the  centered  matrix.  Note  that  unit 
column  scaling  is  not  applied  to  the  cen¬ 
tered  matrix  before  the  condition  number 
is  computed.  This  means  that,  contrary 
to  received  opinion,  correlation  scaling  Is 
not  appropriate  for  predicting  the  effects 
of  rounding  error  on  regression  coefficients 
in  models  with  a  constant  term. 

A  second  way  in  which  our  problem 
can  change  is  that  we  assess  the  effects  of 
collinearily  in  a  different  way.  For  exam¬ 
ple.  although  the  relative  error 


tells  us  a  great  deal  about  the  largest 
components  of  6  ,  it  tells  us  less  about  the 
smaller  ones.  If  these  are  of  concern,  then 
a  better  measure  will  be  the  individual 
relative  errors 


-5  2) 

l^yl  ‘  ’ 

Here  we  end  up  with  p  separate  prob> 
lems,  each  having  its  separate  answer. 

A  third  way  in  which  our  problem 
can  change  is  that  we  might  forget  about 
rounding  error  completely  and  ask  how 
can  (2.6)  be  used  to  predict  the  effects  of 
errors  from  olher  sources  on  the  regression 
coefficients.  Again  the  problem  of  scaling 
must  be  reexamined.  In  (Stewart,  1083)  I 
have  given  tentative  reasons  for  believing 
that  bound  like  (2.6)  is  most  meaningful 
when  the  columns  of  X  are  scaled  so  that 
the  columns  of  E  are  approximately  equal 
—  equal  error  scaling  as  opposed  to  unit 
column  scaling.  If  this  is  true,  then  it 
must  be  concluded  that  near  colllnearity 
is  not  as  basic  a  concept  as  might  be 
wished,  since  a  matrix  may  be  deemed 
nearly  collinear  under  one  class  of  pertur¬ 
bations  and  may  be  well  behaved  under 
another. 

The  conclusion  to  be  drawn  from 
this  is  that  we  should  not  attempt  to 
summarize  something  as  complicated  as 
collincarity  in  a  single  number.  Instead 
we  should  look  at  all  the  techniques  com¬ 
monly  used  in  regression  analysis  and 
analyze  how  collinearity  effects  them.  If 
simplifying  patterns  emerge,  well  and 
good;  but  my  belief  is  that  several  sets  of 
numbers  will  be  required  to  capture  the 
effects  of  collincarity. 
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A  variety  of  augmented  scatter  diagrams  and  stick-pin  maps  are  described.  These  methods 
use  a  nonparametric  bivariate  density  estimator  to  determine  the  color,  representation  and 
masking  of  data  set  elements.  The  identification  of  a  single  datum  with  a  single  "data  point" 
is  considered.  It  is  shown  that  for  some  applications  it  may  be  both  computationally  and  sta* 
tistically  useful  to  represent  each  datum  with  a  spray  consisting  many  individual  symbols. 


1.  INTRODUCTION 

Current  microcomputers  provide  economical  color  as 
well  as  medium  to  high  resolution  graphical  capabil¬ 
ity.  The  scatter  diagram  and  stick-pin  map  can  now 
be  visualized  as  the  pre-computer  era  ancestors  of 
many  new  ways  of  displaying  statistical  information. 
This  paper  describes  a  series  of  experiments  with  a 
variety  of  augmented  scatter  diagrams  and  stick-pin 
maps  which  the  new  generation  of  computational 
hardware  has  made  economically  practical. 

There  seems  to  have  been  very  little  previous  con¬ 
sideration  of  the  intersection  of  the  field  of  model- 
free  or  nonparametric  curve  estimation  and  the  field 
of  graphical  methods  in  statistics.  For  example, 
none  of  the  papers  listed  as  references  to  the  survey 
paper  on  graphical  methods  by  Feinberg  (1979). 
mentions  the  terms;  curve  estimation,  p.d.f.  or  c.d.f. 
estimation  or  model-free  methods.  Even  when  a  pub¬ 
lication  that  concerns  graphics  contains  material 
which  is  related  to  curve  estimation,  this  relevance 
seems  coincidental.  For  example,  Trumbo  (1981) 
carefully  presents  a  theory  for  the  coloring  of  bivari¬ 
ate  statistical  maps.  Principle  II  of  Trumbo’s  (1901) 
paper  states  that  "Important  differences  in  the  levels 
of  a  statistical  variable  shoxUd  be  represented  by 
colors  clearly  perceived  as  different." 

In  the  present  paper,  color  is  discussed  as  a  means 
of  conveying  information  about  an  estimated  bivari¬ 
ate  density  and  not  as  a  means  of  distinguishing 
values  assumed  by  on^  or  more  random  variates. 
Specifically  a  value  Z^=f{X  ,Y),  where  ^  is  a  bivari¬ 
ate  density  estimator  and  (Ji^,  /^)  represents  the  i—th 
of  n  members  of  a  data  set.  is  represented  by  color 
and  other  means.  Note  that  Feinberg’s  (1979) 
classification  of  graphics  lists  histograms  and  scat- 
terplots  in  the  same  Category  4.  The  one  dimen¬ 
sional  analog  of  the  scatter  plot  is  not  a  histogram 
but  rather  a  line  nj^rked  at  univariate  sample  values. 
In  this  paper.  Z^=/ (X^.F.)  can  be  envisaged  as  a  gen¬ 
eralization  of  a  value  ootained  from  a  histogram  as 
distinguished  from  a  point  of  a  scatterplot.  e.g., 
(A'^.F^).  In  Feinberg’s  (1979)  paper,  contours  are 
mentioned  under  the  category  of  "graphs  not  involv¬ 
ing  data."  In  this  paper  a  way  of  visualizing  contours 
which  are  created  from  sprays  of  data  is  presented. 
The  reason  the  stick-pin  map  provides  a  particularly 
good  framework  for  discussion,  is  that  a  stick-pin 
can  be  viewed  as  having  a  head  which  conveys 


graphic  information  to  the  viewer,  attached  to  a 
point  which  associates  this  information  with  a  loca¬ 
tion  in  a  two  dimensional  space  by  a  shaft  of  a  length 
which  could  be  made  proportional  to  estimated  pro¬ 
bability  density.  It  will  be  demonstrated  that  both 
the  choice  of  head  characteristics  and  the  choice  of 
pointer  location  can,  in  microcomputer  applications, 
be  made  to  suit  a  variety  of  applications. 

The  topics  considered  here  can  be  viewed  as  exten¬ 
sions  of  two  basic  procedures.  Kronmal  and  Tarter 
(1974,  pp. 377-301)  present  estimates  of  bivariate 
densities,  one  of  which  is  shown  in  Figure  1.  In 
essence,  the  production  of  this  figure  first  involved 
the  bivariate  estimation  of  a  density,  e.g..  /(x.y),  as 
described  in  Tarver  and  Silvers,  (1975)  and  secondly 
the  graphing  of  /  .  In  the  latter  paper,  several  con¬ 
tour  diagrams  are  presented  which  depict  density 
estimators  such  as  /  shown  in  Figure  2.  The  routine 
used  to  produce  Figure  2  traces  each  contour  and 
only  evaluates  /  at  points  near  each  contour.  How¬ 
ever,  in  Figure  1,  /  is  evaluated  at  every  x.y  coordi¬ 
nate  of  a  grid  of  points.  This  latter  procedure  uses 
simpler  computer  code,  which  is  an  important  con¬ 
sideration  for  a  subroutine  designed  to  be  moved 
easily  to  a  variety  of  microcomputer  systems.  On  the 
other  hand,  since  the  number  of  required  evaluations 
of  /  increases  quadralically  with  increase  in  graphi¬ 
cal  monitor  resolution,  the  computer  time  demands 
of  this  simple  code  may  be  substantially  greater  than 
those  of  more  complicated  contouring  routines,  such 
as  that  utilized  to  obtain  Figure  2. 

The  second  procedure  was  developed  by  Tarter 
(1979)  and  depended  on  the  observation  that  an 
estimated  bivariate  as  well  as  univariate  density 
could  serve  as  a  useful  data  transformation.  Con¬ 
sider  that  the  color  or  shape  of  each  slick  pin-head 
In  a  conventional  slick-pin  map  can  be  chosen  on  the 
basis  of  density  height  estimated  at  the  location  of  a 
data  point.  If  one  were  only  interested  in  the  rare  or 
unusual  event,  one  could  choose  to  insert  a  pin  only 
at  those  points  over  which  the  estimated  density  is 
less  than  a  constant.  In  an  analogous  way.  Figure  3 
was  obtained  after:  1)  a  bivariate  density  estimator 
/  was  computed.  2)  A  sequence  \Z^\  was  obtained  by 
using  the  bivariate  density  estimator  /  as  a  transfor¬ 
mation.  Specifically,  the  t—ffi  member  of  the  sample 
where  i  =  l,...n  was  associated  with  a  value 
Z^  =  Each  value  of  can  be  interpreted  as 

an  estimate  of  the  sparseness  or  richness  of  density 


r 


within  a  fixed  size  neighborhood  of  the  point 
An  exchange  of  variables  option  was  exercised  to 
plot  the  point  pairs  i  =  alternately. 

i  =  l,...n;  could  have  been  plotted  as  a  prel¬ 
iminary  to  the  next  step.  4)  An  editing  routine  was 
used  to  select  those  i  =  points 

was  greater  than  a  constant.  (This  process 
all  but  the  most  sparsely  distributed  points.  5)  The 
variable  exchange  option  was  again  exercised  to 
replace  the  display  of  the  edited  (X^.Z^)  values  with  a 
display  of  corresponding  values,  i.e,,  the 

sparsely  distributed  subset  of  the  original  sample.  A 
printer  plot  routine  accompanies  the  program. 

The  basic  feature  which  differentiates  the  method 
used  to  obtain  Figure  2  from  that  used  to  obtain  Fig¬ 
ure  1  is  that  the  end  product  of  Figure  2  is  a  display 
of  data  points  i.e.,  is  associated  with  a  sample;  while 
Figure  1  illustrates  an  estumator  of  an  underlying 
population.  In  essence  the  method  used  to  generate 
Figure  2  goes  full  circle,  i.e..  starts  with  a  sample  and 
then  uses  a  population  density  estimator  to  modify  a 
display  of  sample  elements.  Displays  such  as  Figure 
1  and  the  contours  of  Figure  3.  shed  all  reference  to 
individual  sample  elements  in  order  to  convey  infor¬ 
mation  regarding  thf  global  or  overall  nature  of  an 
estimated  density.  As  previously  mentioned,  the  pro¬ 
cedures  to  be  described  in  this  paper,  i^ith  varying 
degrees  of,  success  both  display  global  distributional 
characteristics  and  the  fine-structure  of  the  sample. 
They  also  tend  to  resemble  the  routines  used  to  pro¬ 
duce  Figure  1  and  2  and  differ  from  contouring  rou¬ 
tines.  in-so-far  as  they  involve  simple  and  transport¬ 
able  computer  code. 

2.  GENERAL  METHODOLOGY 

The  contours  shown  in  Figure  3  were  estimated  from 
one  thousand  random  variates  generated  from  the 
three  component  mixture  of  bivariate  densities, 
(l/3)yV(6.9.l.5.1.5.0.5)  +  (l/3)/V(l0.10.1.5.l.5.-.7)  + 
(1/ 3)yV(14, 1 1, l.5,1.5,0.5).  (the  order  of  the  parame¬ 
ter  arguments  of  N  is  ).  using 

methods  described  in  Tarter  ana  SilN^ers  (1^75).  The 
techniques  to  be  described  in  this  paper  do  not 
depend  upon  the  computational  tractability  of  the 
underlying  density  estimator.  This  is  not  the  case 
with  contouring  method.*?  which  rely  on  gradient  pro¬ 
cedures  and  therefore  the  numerical  or  analytical 
tractibility  of  f's  partial  derivatives.  Since  any 
accurate  bivariate  density  estimator  can  be  used  in 
conjunction  with  the  methods  to  be  described  in  this 
paper,  for  the  sake  of  brevity,  we  will  omit  the 
specific  steps  used  to  obtain  /  and  leave  these  steps 
to  the  tastes  and  needs  of  the  reader. 

Figure  4  was  obtained  from  the  same  size  sample  and 
underlying  distribution  that  was  used  to  generate 
Figure  3.  To  obtain  this  graphical  display  all  five 
techniques  to  be  described  in  this  paper  were 
applied.  These  are:  1)  Spraying.  2)  Masking,  3)  Band¬ 
ing.  4)  Color  and  5)  Symbol  differentiation.  To  imple¬ 
ment  all  these  techniques  the  fundamental  idea 
which  led  to  the  generation  of  Figure  2  was  utilized. 

Specifically  the  estimated  value  =/(A^.F)  i  =  l . n 

was  used  to:  1)  Pick  the  color  of  a  display  character 

2)  Determine  whether  a  given  point  should  or  should 
not  be  displayed.  3)  Select  the  number  of  display 
points  to  be  associated  with  each  datum,  4)  Mask 
display  points  to  better  visualize  the  edges  of  an 
estimated  terrace  (this  procedure  is  analogous  to 


where  Z. 
maskea 


the  trimming  of  the  borders  of  a  lawn.)  and  5)  Select 
the  appropriate  symbol  for  display  purposes. 

Note  that  unlike  the  display  shown  in  Figure  1.  / 
evaluations  by  these  new  procedures  are  required 
either  at.  or  in  the  neighborhood  of,  n  data  points 
and  not  at  all  grid  points.  We  have  experimented 
with  modifications  of  the  methods  used  to  obtain  Fig¬ 
ure  1  which  were  designed  to  use  a  series  of  evalua¬ 
tions  of  /■  over  a  widely  spaced  grid  to  determine  the 
need  for  refinement.  These  routines  not  only 
required  a  cumbersome  and  lengthy  code  but  failed 
to  resolve  detail  for  a  variety  of  test  patterns. 

The  methods  to  be  described  here  have  a  tendency 
to  emphasize  data  anomalies  since,  being  elabora¬ 
tions  of  simple  scatter  diagrams  and  stick-pin  maps, 
fine  structure  is  clearly  resolved.  On  the  other  hand, 
particularly  when  the  spraying  technique  detailed  in 
Section  3  is  utilized,  global  population  characteris¬ 
tics  can  usually  be  as  clearly  discerned  by  the  new 
procedures  as  with  contouring  techniques  (the  latter 
tends  to  both  smooth  over  sample  fine  structure  and 
require  a  code  highly  dependent  on  the  means  used 
to  obtain  the  bivariate  estimator /(x.y)). 

Naturally  in  some  systems  a  superposed  scatter 
diagram  and  contour  diagram  may  be  a  reasonable 
substitute  for  the  new  techniques  described  in  this 
paper.  Note  however,  that  what  appear  to  the  eye  as 
contours  generated  by  the  new  methods  are  actually 
formed  directly  from  the  data  points  themselves. 
The  previously  mentioned  Tarter,  Silvers  (1975) 
paper  and  considerable  earlier  work  by  Gregor 
(1989)  and  others  deal  with  procedures  for  modifying 
and  underlying  density  estimate  to  either  increase  or 
decrease  the  contrast  between  distribution  com¬ 
ponents.  A  composite  or  overlay  of  the  scatter 
diagram  computed  from  one's  original  data  and  a 
contour  diagram,  in  essence  separates  the  bead  and 
point  of  each  stick  pin.  On  the  other  hand  .  since 
contours  are  actually  formed  from  the  scatter 
diagram  or  stick-pin  head  by  the  new  method,  it  is 
easy  to  associate  the  effects  of  the  contrast 
modification  process  upon  individual  or  subgroups  of 
points. 

Before  turning  to  the  specific  means  of  creating  aug¬ 
mented  scatter  diagrams  and  stick-pin  maps,  it 
seems  appropriate  to  summarize  the  following  basic 
algorithm: 

1)  A  bivariate  estimate  f(x,y)  is  obtained  from 

the  sample  jXj.Fj  i  =  l . n. 

2)  The  sequence  =/ i  =  l n  is 

obtained. 

3)  The  \Z.  I  sequen^s'  is  ranked. 

4)  The  ranked  values  of  the  sequence  are 

used  to  determine  the  properties  used  to  display 

each  datum  i  =  l . n. 

The  word  "datum”  rather  than  "data  point"  is  used 
above  because  is,  as  we  shall  see,  a  spray  of  points 
can  be  usefully  associated  with  a  single  datum. 

The  programs  which  generated  the  maps  in  this 
paper  are  written  in  Fortran  77  under  I'le  UNIX  4.2 
operating  system  and.  with  possibly  minor  altera¬ 
tions  in  the  i/o  portion.*?  the  routines,  can  be  com¬ 
piled  with  mo.sl  alternative  F77  compilers 
3.  SPRAYING  AND  MASKING 

We  will  now  suggest  that  there  may  br  conside  able 
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practical  value  to  uaing  several  disconnected  sym¬ 
bols  to  represent  a  single  datum  and  even  In  some 
cases,  representing  some  data  with  fewer  "points" 
than  other  data. 

Generally  speaking  the  chief  advantage  of  using  a 
spray  of  points  to  represent  a  single  datum  is  that  at 
the  edges  of  what  we  will  later  define  as  a  contour, 
the  spray  can  be  masktd  in  order  to  give  the  track¬ 
ing  eye  useful  information  about  the  shape  of  the 
contour.  As  an  analogy,  consider  that  the  user  of  a 
can  of  spraypaint  delivers  a  cloud  or  scatter  of  dro¬ 
plets  for  each  pull  of  the  spraygun's  trigger. 
Towards  the  center  of  a  large  area  to  be  sprayed  a 
single  color,  most  of  the  droplets  srill  usually  reach 
the  object  to  be  painted.  However,  along  the  border 
separating  two  colors,  masking  tape  or  a  masking 
tool  is  used  to  block  off  a  significant  portion  of  dro¬ 
plets.  Now  consider  the  important  fact  that  when  the 
painter  knows  that  he  is  painting  the  interior  of  an 
object  he  or  she  need  not  be  concerned  about  the 
use  of  a  masking  tool.  This  basic  principal,  when 
applied  to  statistical  graphics,  makes  it  computa¬ 
tionally  economical  to  program  spraying  techniques. 
SpeciQcally  suppose  five  display  points  are  used  to 
represent  a  single  datum  where  four  of  the  flve 
points  are  corners  of  a  square  centered  at  what  for 
conventional  procedures  would  be  the  fifth,  i.e.,  the 
"data",  point.  Now  deflne  a  contour  of  bivariate  esti¬ 
mator  /  as  the  locus  of  (z ,]/)  points  where  /  (z  ,y  )  =  C 
where  C  is  some  positive  constant  smaller  than  the 
largest  value  assumed  by  /  over  the  entire  z,y 
plane.  In  this  discussion  we  will  also  assume  that  the 
contour  is  a  single  closed  curve. 

Consider  two  distinct^contours  associated  with  the 
loci  /(j:,y)  =  C|  and  /(z,y)  =  Cj  respectively,  where 
Cj^Cj.  The  region  between  these  contours  will  be  a 
bracelet  shaped  or  banded  subregion  R{C^  C^)  of  the 
z,y  plane.  Suppose,  as  in  the  previous  section,  the 

sequence  <  =  1 n  is  obtained  and 

ranked  to  form  the  sequence  =  l . n.  Under 

the  previous  assumption  that  each  contour  is  a  sin¬ 
gle  closed  curve,  the  indices  ((i)|  of  associated 
with  those  data  points  which  fall  within 

/?(C|.Cj)  will  form  a  consecutive  sequence  S(C,,Cj). 
Now  suppose  a  particular  plotting  character  is  to  be 
placed  at  each  of  the  flve  coordinates  (X, K). 

(X.r+e).  ^X^-t.Y^).  {X^.Y^-t).  provided 
that  each  of  these  flve  points  is  within  R{C^,C^) 
(where  i  is  a  small  positive  constant).  Those  values 
of  these  flve  coordinates  which  are  most  likely  to 
protrude  beyond  the  region  R(C^  C^)  will  be  associ¬ 
ated  with  Xjjj  indices  near  the  beginning  or  end  of 
the  sequence  of  indices  5(C,,Cj).  This  fact  allows 
one  to  construct  a  computer  program  which 
significantly  reduces  the  computer  time  required  to 
trim  the  edges  of  the  five  point  datum  spray. 

For  example,  to  obtain  Figure  4.  the  following 
sequence  of  steps  was  utilized: 

1)  The  one  thousand  random  data  elements  from  the 
distribution  described  in  the  previous  section  were 
used  to  obtain  a  bivariate  estimator  / . 

2)  F,ach  of  the  one  thousand  data  elements,  i.e., 

(XyY^  }  =  l . 1000,  was  transformed  to  -f(X^.  Y^). 

3)  The  set  \Z^l  j  =  l . 1000  was  ranked  to  form 

We  will  henceforth  define  Cj  as  the  largest  value 
assumed  by  /  over  the  space  within  which  /  is  to  be 


displayed.  The  method  which  was  used  to  produce 
Figures  4  through  8  was  based  on  the  specification  of 
a  set  of  regions  in  terms  of  Z^y  i.e.,  the  kth  region 
will  contain  those  points  for  which  Z^^^  is  in  the  half 
open  interval  (u  C  ,v^  C^],  where  is  the  left  end¬ 
point  and  f7g  is  the  right  endpoint  of  the  kth  inter¬ 
val  (The  sequence  of  u  and  values  is  chosen  byi 
the  user).  The  set  of  intervals  used  in  Figure  4 
through  10  of  this  paper  was  (C,..95C  ], 
(.95C  ,  90C  ),  (.90C  ,.B0C  ],  (.70C...55cJ, 

(.55C,.  35Cj.  (.35C,,.10Cj].  (.10C,,0.0].  These  were 
chosen  in  order  to  assure  that  standard  uncorre¬ 
lated  normal  data  would  generate  bands  one  through 
six  of  approximately  equal  width.  In  Section  5  of  this 
paper,  practical  reasons  for  using  special  pro¬ 
cedures  to  choose  the  lowest  band,  here  band  seven, 
will  be  discussed.  Only  the  1st,  3rd,  Stb,  and  7th 
Intervals  are  plotted  using  red,  green,  blue,  black, 
respectively,  for  the  color  figures.  As  elaborated 
upon  in  the  next  section,  the  remaining  intervals 
form  the  blank  bands. 

5)  The  inner  and  outer  edges  of  the  regions  are 
masked  as  follows: 

(Mte)  fM«V) 

i)  Let  Z^  and  Z^  be  the  smallest  and 

largest  values,  receptively,  of  Z,^.  in  the 
kth  region,  i.e.,  is  the  smallest  value 

and  *  is  the  largest  elements  of  the  set 

represent  the  density  estimate  at  the 
outer  and  inner  bordering  contours  of  the 
kth  region,  respectively. 

ii)  The  area  beyond  the  outer  bordering  con¬ 

tour  of  the  kth  region  is  masked  by  com¬ 
paring  the  density  at  each  of  the  four  peri¬ 
pheral  points  of  the  spray  to  .  i.e..  only 
those  points  for  which  /  is  greater  than  or 
equal  to  are  plotted.  Beginning  with 

XywWhere  j  is  the  smallest  element  of 

each  datum  is  sequentially 
sprayed  and  masked  until  the  masking  pro¬ 
cess  fails  to  reject  any  peripheral  points  m 
consecutive  times. 

iii)  An  analogous  process  to  that  described  in  ii 

is  used  to  mask  the  area  beyond  the  inner 
bordering  contour  of  the  kth  region  except 
that  the  starting  value  is  Z  .  where  r  is  the 
largest  element  of  and  the  index 

r  is  sequentially  decremented.  The  process 
terminates  when  none  of  the  points  in  m 
consecutive  flve  point  sprays  fail  to  be  plot¬ 
ted. 

One  can  conceive  of  examples  where  the  plateau-like 
shape  of  /  might  cause  the  spraying  process  to  ter¬ 
minate  prematurely.  Such  problems  are  easily 
remedied  by  increasing  the  value  chosen  for  m. 

The  e  used  as  part  of  the  spraying  process  was 
specified  as  either  a  percent  of  the  sample  range  or 
as  a  percent  of  the  sample  standard  deviation.  For 
all  but  a  very  few  applications,  e  will  differ  in  the  x 
and  y  directions  (horizontal  and  vertical,  respec¬ 
tively.  Technically  t  should  be  subscripted  as  either 
or  However,  because  in  our  program,  both  e 
and  e  are  determined  by  a  single  user  assigned  muP 
tiplier  of  the  estimated  range  or  standard  deviation, 
the  use  of  subscripts  was  felt  to  be  an  unnecessary 
notation.  In  the  examples,  e  was  chosen  to  be  2Z  of 
the  sample  range.  In  some  applications  involving 


natural  rather  than  simulated  data,  it  may  be  prefer¬ 
able  to  use  the  sample  standard  deviation  or  a  scale 
parameter  estimator  which  is  robust  with  respect  to 
outliers. 

It  should  be  mentioned  that  there  are  many  alterna¬ 
tive  means  that  can  be  used  to  arrange  the  peri¬ 
pheral  points,  Le.  spray.  However,  the  arrangement 
used  in  the  examples  is  easy  to  program  and  gives 
useful  regions.  Notice  the  gaps  in  the  3rd  and  5th 
contours  of  Figure  4  (green  and  blue  contours, 
respectively).  These  gaps  can  be  filled  in  by  using  a 
broader  spray.  For  ninepoint  spray  (X.Y), 

(A;+s..r.).  (A;.r,+e,).  (Jf..F.-e,). 

ej  =  ae  .  mullipUer  values  a  oeiween  two  and  four, 
were  found  to  be  efTective.  The  rationale  for  the 
ninepoint  spray  Is  that  the  inner  points,  i.e.,  those 
offset  by  give  definition  to  the  edge  of  the  contour 
and  the  outer  points,  i.e.,  those  ofiset  by  by  make 
the  contours  appear  continuous.  Of  course,  the 
broader  the  spray  the  greater  the  risk  of  masking 
features  which  provide  clues  about  the  fine^structure 
of  the  underlying  density.  Thus,  the  use  of  very 
broad  sprays  tends  to  negate  the  advantages  of  this 
new  approach  to  contouring  over  the  usual  method 
of  contouring  illustrated  in  Figure  3.  (This  is  true 
both  in  terms  of  processing  time  and  resolution  of 
detailed  features  of  the  data.) 

We  will  now  proceed  to  discuss  and  illustrate  the  fifth 
step  in  the  above  process;  the  choice  of  display  sym¬ 
bol. 

4.  BLANK  BANDING 

An  important  and  interesting  aspect  of  these  tech¬ 
niques  is  the  decision  about  whether  to  plot  a  given 
data  point.  It  may  seem  paradoxical  that  a  datum 
can  provide  more  information  by  not  being 
represented  than  by  being  represented.  Consider 
however,  that  possible  data  outliers  will  generally  be 
represented  by  points  in  only  one  of  the  bands,  te., 
the  band  associated  with  the  smallest  values  of  /. 
Therefore,  as  long  as  the  masking  procedure 
described  in  the  previous  section  does  not  blank  out 
the  lowest  band,  information  about  extraordinary 
data  values.  i.e.«  outliers,  will  be  adequately  conveyed 
even  if  data  in  other  bands  are  not  represented.  As 
will  be  discussed  in  the  last  section  of  this  paper  it 
may  be  useful  to  mask  some  outliers. 

There  can  however,  be  distributional  details  within  a 
body  of  data  which,  although  not  what  could  be 
called  global  features,  are  of  Interest  and  impor¬ 
tance.  Consider,  for  example,  what  might  be  called  a 
data  "dimple i.e..  a  small  set  of  values  sur¬ 
rounded  by  a  region  W.  such  that  underlying  density 
/(x,y)«/(ti.v)  where  tx,y)eW  and(u.v)eW2.  Good 
and  Gaskins  1980.  use  the  word  "dip"  for  the  feature 
we  call  a  "dimple".  Because  of  the  smoothing 
inherent  in  many  nonparamctric  estimates  /“  of  f 
the  presence  of  a  dimple  may  not  greatly  influence  / . 
particularly  if  the  kind  of  dimpling  often  referred  to 
as  digit  preference  pertains  to  one's  data  where  a 
dimple  is,  in  effect,  in  close  proximity  to  a  "raised 
area".  However,  since  %he  methodology  described 
here  only  uses  display  symbols  at  or  near  the  coordi¬ 
nates  of  a  datum,  even  if  / .  is  fairly  constant  at  and 
around  a  dimple,  the  presence  of  a  dimple  will  be 
apparent  in  many  cases. 


It  has  been  our  experience  that  data  is  either  very 
dimpled,  e.g..  as  would  be  the  case  for  most  situa¬ 
tions  where  digit  preference  pertains,  or  has  few  if 
any  dimples.  Thus,  if  blank  bands  alternate  with 
display  bands,  and  if  within  diplay  bands  each  datum 
is  represented  by  one  or  more  symbols,  it  is  unlikely 
that  data  dimpling  will  go  undetected  since  it  is 
unlikely  that  all  dimples  will  occur  exclusively  within 
blank  bands. 

Figures  8  through  B  illustrate  the  advantages  of 
blank  banding  and  masking.  The  first  display  of  this 
sequence  is  a  typical  scatter  diagram  obtained  from 
a  one-thousand  clement  random  sample  from  the 
same  three  component  mixed  normal  distribution 
used  to  obtain  Figure  4.  Figure  6  illustrates  the  use 
of  blank  banding  where  a  single  data  point  is  used  to 
represent  a  single  datum.  The  next  display  illus¬ 
trates  the  effect  of  spraying  without  masking.  While 
the  swarm  of  points  shown  in  Figure  7  is  more  vivid 
than  that  shown  in  Figure  6,  we  have  found  that  this 
same  effect  could  be  much  more  easily  obtained  by 
representing  each  point  by  a  larger  symbol,  e.g.,  a 
circle  or  X.  Finally.  Figure  B  clearly  illustrates  the 
advantages  of  spraying  and  masking  in  terms  of  most 
user's  ability  to  discern  global  distributional  struc¬ 
ture. 

5.  SYMBOLS.  COLORS  AND  SPRAYING 
When  we  first  began  the  research  described  in  this 
paper,  our  impression  was  that  the  most  useful 
display  characteristic  which  the  could  deter¬ 

mine  would  be  found  to  be  color.  We  now  feel  that 
banding  and  spraying  can  often  yield  so  much  infor¬ 
mation  that  the  use  of  a  color  as  opposed  to  mono¬ 
chrome  display  may  be  an  unnecessary,  albeit 
attractive,  luxury. 

Tbe  choice  of  which  particular  sequence  of  color  or 
^mbols  to  use  to  represent  a  particular  sequence  of 
/  values  is  closely  connected  to  representation  r'‘ 
perspective  as  outlined  in  L.  Gurry's  book  Cftanns  ei 
L'sxpression  L  tie  L'espaeff,  1950,  page  6,  which  con¬ 
tains  a  brief  history  of  the  artists'  use  of  color  and 
other  techniques  to  represent  three  dimensions  on  a 
two  dimensional  surface. 

In  general,  warm  colors  such  as  red  seem  to  be 
ideally  suited  to  representation  of  points  where  / 
assumes  large  values,  i.e.,  those  which  would  be 
closest  to  the  viewer  if  a  three  dimensional  model 
rather  than  a  two  dimensional  symbolic  representa¬ 
tion  were  used.  Conversely,  blue  ond  finally  black 
appear  to  be  the  best  final  colors  to  use,  where  as 
illustrated  in  Figure  4,  black  squares  represent  the 
outlying  points  where  /  values  are  smallest. 

Use  of  black  squares,  dots  or  circles  at  low  density 
levels,  and  larger  sized  "hats",  "tildes",  or  "pulses"  at 
higher  levels,  can  simulate  the  visual  clue  that  when 
viewed  from  above,  if  the  height  of  a  stick-pin  whose 
head  were  the  chosen  colored  symbol  were  propor¬ 
tionate  to  /.  then  this  stick-pin  head  would  appear 
smaller. 

It  should  be  noted  that  many  symbols  have  particu¬ 
lar  asymmetries  which  can  be  used  to  advantage. 
For  example,  squares  and  'Tiafs"  fend  fo  roughen 
edges  while  "tildes"  and  "pluses"  tend  to  "mask”  well, 
I.C.,  lead  the  eye  comfortably  along  a  curved  contour. 
§moolhnes8  is  usually  desirable  for  moderate  to  high 
/^.  i.e.,  stick-pin  height,  levels  since  it  is  for  these 


values  that  global  distributional  features  are  often 
visualized.  Therefore  it  seems  ideai  to  use  squares 
and  other  roughening  symbols  to  represent  outlying 
stick-pins,  i.e.,  low  values  of  /.  Also,  a  symbol  such 
as  a  "tilde”  or  "bat",  which  is  longer  than  it  is  high, 
can  enhance  resolution  if  its  longer  dimension  paral¬ 
lels  that  of  the  data  display.  For  example,  in  Figure 
4,  since  the  overall  display  of  points  is  longer  than  it 
is  wide,  it  seemed  advantageous  to  place  tildes  and 
bats  as  shown.  Here  again,  artists  using  most 
engraving  methods  had  learned  to  place  the  trurin  of 
their  scratches  parallel  to  the  curves  they  wished  the 
eyes  of  their  viewers  to  scan.  (Antraesian,  Abrams 
1971,  p.376;  Oxford  1971,  p.  1188). 

As  one  final  point,  it  should  be  mentioned  that  there 
may  often  be  reasons  for  using  a  blank  band  to 
represent  points  {X.Y)  where  f{X,Y)<6  for  some 
small  positive  value  of  S.  We  have  often  worked  with 
sets  of  data  where  the  presence  of  one  or  a  few 


extreme  outliers  has  led  to  an  unsatisfactorily 
compressed  view  of  the  bulk  of  data  points.  It  seems 
advisable  in  these  situations  to  select  the  viewing 
area  so  that  the  bulk  of  the  data  are  satisfactorily 
displayed,  and  to  indirectly  indicate  outlying  data 
that  cannot  be  displayed.  One  method  of  doing  this 
is  to  report  the  number  of  outliers  outside  the 
viewport.  A  more  informative  method  is  to  place  a 
special  symbol  just  inside  the  border  of  the  viewport 
at  the  intersection  of  the  viewport  border  line  and 
the  line  connecting  the  centroid  of  the  lowest  con¬ 
tour  (or  contours)  and  the  outlier.  If  the  graphical 
system  available  to  the  user  is  sufficiently  flexiable, 
one  can  go  a  step  further  and  let  the  size  of  the  sym¬ 
bol  or  the  choice  of  the  symbol  itself  suggest  the  dis¬ 
tance  between  the  outlier  and  the  symbol.  For  exam¬ 
ple,  if  the  chosen  symbol  is  a  letter,  then  the  greater 
this  distance  from  the  outlier  to  the  viewport,  the 
later  the  letter  can  be  chosen  from  the  alphabetic 
sequence. 


111.401  11112112111112111112 

110.201  1121111  1111111 
111.001  2111  44**444444444  till 

115.101  221  4444  4444  111 

115.401  221  444  4**444444444  444  112 

111.401  222  4  44  111*  444  44  11 

14V. 201  2  44  44  111!  Ill**  44  44  12 

141.001  114444  11  111  4  4  11 

144.101  2144*4  11  aanaa  ii  *  4  i 

141.401  22  4  4  11  HNaNHNa  11  4  4  1 

140.401  22  44  44  11  MNanaNN  11  44  44  11 

151.101  22  4  4*  11  aaaaaa  ll  4  44  il 

154.001  1244  441*1  11  *4  4  11 

151.101  12  44  *4  1*1  111  44  4  111 

151.401  11  44  4*  It***  llltl  44  44  111 

14V. 401  111  44  44  ****•*•*•  *44  44  111 

147.101  221  444  *4*  •*••••  44  44  111 

145.001  121  44  444  111  444  44  211 

141.101  211  444  444  444  444  111 

140.401  211  44  44  *4  44  111 

111.401  22  444  4*  *****  444  44  122 

154.101  11  44  *4  111***!*  44  44  12 

114.001  11  44  *4  llllllllll  *4  44  22 

111. SOI  2  444*  *1*  111  *444  2 

I1V.40I  2144*4  1*  1*1  4  4  2 

111.401  2144*41*  SMS  1*  444  1 

115.101  2  4  *4  1*  essss  *1  *  4*  11 

125.001  114*1  asanas  1*412  2 

110.101  144**1*  aaaaa  1*444  i  l 

111.401  14**1  1*444411  11 

114.401  444**1  1**4441  1 

114.201  4  44  ********  *4  4411  11 

112.001  444*4  44  4421  22 

lOV.lOl  44  4*44*4*444  444  111  11 

101.401  2  4444  4444  111  212 

105.401  111  4444444444444  1111  12 

105.101  2212112  221111 

101.001  121111111122221211111111 

Va.SOI  112212211111111 

V4.40I 

V4.40I 

V2.20I 

VO.OOl 
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FIGURE  1.  BIVARIATE  PRINTER  GRID  PLOT  SHOWING 
BANDED  CONTOURS. 


FIGURE  4.  AUTOMATED  STICK-PIN  MAP  -  LEVELS  DIS¬ 
TINGUISHED  BY  SYMBOL  AND  BANDING  -  SPRAYING 
AND  MASKING  USED. 
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FIGURE  B.  MrrOMATED  STICK-PIN  MAP  -  BOTH  SPRAY¬ 
ING  AND  MASKING  USED. 
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ABSTRACT 


This  paper  discusses  experience  with  the  S  system  as  an  applications  programming  environment.  It  also 
considers,  in  the  context  of  data  analysis  and  graphics,  the  class  of  workstations  called  integrated  programming 
environments.  Current  research  on  a  merging  of  the  needs  of  computing  for  data  analysis  with  the  attractive 
features  of  integrated  programming  environments  is  outlined. 


1.  Introduction. 

This  paper  looks  at  interactive  programming 
environments  for  applications  using  data  analysis, 
graphics  and  related  kinds  of  computing.  The  next 
two  sections  give  a  view  of  the  history  of  S  and  of 
recent  ideas  in  the  general  field  of  ’integrated 
programming  environments",  in  the  context  of  the 
present  conference  we  emphasize  the  experience 
gained  by  a  substantial  number  of  applications 
development  groups  from  using  S  as  a 
programming  environment  for  their  work.  The  last 
section  outlines,  necessarily  briefly,  current  research 
aimed  at  combining  the  important  features  of 
integrated  programming  environments  with 
facilities  needed  for  quantitative  (scientific) 
computing;  for  example,  access  to  algorithms  for 
numerical  or  graphical  computations.  Experience 
gained  from  the  use  of  S  as  an  applications 
programming  environment  for  business  research, 
data  anaiysis.  engineering  projects  and  other 
applications  is  being  used  to  guide  the  new  design, 
particularly  in  terms  of  combining  flexibility  with 
run-time  efficiency. 

2.  The  S  System. 

S  is  a  language  and  system  for  the  interactive 
analysis  of  data,  developed  at  Bell  Laboratories  and 
currently  in  use  on  the  operating  system.  Two 
books  [l;  2],  describe  respectively  how  to  use  the 
system  for  data  analysis  and  graphics,  and  how  to 
extend  the  system  by  incorporating  new  algorithms 
as  S  functions.  The  design  of  S  and  its  relation  to 
other  work  in  computer  science  and  in  statistical 
computing  are  described  in  I3]. 

We  designed  S  to  enable  and  encourage  good  data 
analysis,  by  letting  users  look  quickly  and 
conveniently  at  many  displays,  summaries,  and 


models  for  their  data.  In  addition,  we  emphasized 
in  our  design  the  ability  to  extend  S.  Users  could 
write  S  macros  to  encapsulate  analyses  that  were  to 
be  repeated,  possibly  with  differing  arguments. 
They  could  develop  new  functions  that  interfaced 
to  arbitrary  algorithms  (typically  FORTRAN 
subroutines),  not  necessarily  designed  for  use  with 
S  originally.  Also,  and  unusually  for  such  systems, 
S  allowed  easy  creation  of  arbitrary  new  data 
structures  to  represent  new  analyses,  plots,  etc. 

These  facilities  have  made  S  into  an  applications 
programming  environment,  which  a  variety  of 
groups,  at  Bell  Labs,  at  AT&T  and  elsewhere 
(notably  at  universities),  have  used  to  create  other, 
often  more  specialized,  systems.  We  anticipated 
that  this  use  would  be  made  of  S,  and  provided  a 
number  of  features  accordingly.  (Besides  those 
mentioned  above,  there  are  facilities  for 
documenting  user  extensions,  for  writing  menu- 
driven  interfaces  in  S,  and  for  incorporating  S 
results  in  report-generation  software.)  In  a  typical 
scenario,  a  few  of  the  more  adventurous  computer 
users  in  a  local  group  find  out  about  S,  and  begin  to 
exp)eriment  with  it  for  the  needs  of  the  group. 
After  a  while,  these  users  decide  to  create  some 
more-or-less  canned  facilities,  built  on  S,  that 
would  then  be  a  system  to  be  used  by  other 
members  of  the  group.  In  the  two-tier  user 
community  resulting,  the  later  users  might  have 
little  direct  contact  with  either  S  or  the  operating 
system. 

The  advantages  of  using  S  for  such  purposes  are 
several.  S  is  designed  to  be  easy  to  use  and  highly 
interactive,  it  supports  interactive  graphics  on  a 
variety  of  devices.  By  using  the  macro  facility,  new 
analyses  can  be  coded  and  tested  easily.  The 
ability  to  write  compiled  functions,  interface  to 


external  algorithms,  and  define  new  data  structures 
means  that  the  extensions  possible  are  unlimited. 
Feedback  to  us  from  about  20  applications  projects 
has  indicated  that  S  has  provided  a  substantial 
increase  in  the  productivity  of  the  developers 
compared  either  to  programming  in  a  language  like 
C  or  FORTRAN  or  to  the  use  of  other,  less 
flexible,  systems. 

This  extensive  experience  on  the  part  of 
applications  developers  has  also  contributed  several 
new  challenges  to  improve  the  system.  Here,  as 
often,  there  is  a  conflict  between  ease  of 
implementation  and  efficiency  of  computation. 
Writing  S  macros  is  easy,  particularly  up  to  the 
point  of  trying  to  make  the  macros  themselves 
"friendly"  to  the  end  users.  But  occasionally  the 
computations  involved  are  difficult  to  express  in  S. 
More  frequently,  serious  inefficiencies  can  result 
when  the  macros  are  applied  to  sizable  data  or  are 
themselves  used  in  an  iterative  fashion.  The  usual 
cure  attempted,  to  write  the  same  calculations  in  a 
compiled  function,  helps  in  most  cases  but  requires 
substantially  greater  programming  activity  on  the 
developer's  part. 

The  fundamental  problem,  to  a  large  extent,  is  that 
the  application  developer  is  working  not  with  one 
language  and  environment,  but  with  three  or  four. 
Further,  these  languages  inherit  a  degree  of  mutual 
inconsistency  from  the  software  tools  used  to  create 
them.  The  current  S  environment  depends  heavily 
both  on  existing  tools  and  on  tools  specially  adapted 
for  S.  The  macro  processor  is  a  version  of  the  m4 
macro  processor.  The  languages  in  which  new 
functions  and  new  algorithms  for  compilation  with 
S  are  written  are  extensions  of  FORTRAN 
utilizing  the  Ratfor  preprocessor  and  m4.  Heavy 
use  of  tools  was  an  important  factor  in  making  S 
work  in  the  first  place,  and  in  the  ease  with  which 
its  design  has  adapted  to  rapidly  evolving 
computing  environments  over  the  last  five  years  or 
so.  However,  the  price  paid  includes  inconsistencies 
among  the  various  levels  of  S  as  a  programming 
environment. 

The  challenge  for  our  current  research  is  then  to 
attack  simultaneously: 

•  simplifying  the  application  developer’s  view  of 
the  programming  environment; 

•  making  S  more  efficient  for  the  kind  of  use 
described  above. 

Before  outlining  the  implications  of  this  challenge, 
let  us  look  at  another  aspect  of  recent  computing 


that  points  in  an  interestingly  similar  direction. 

3.  Integrated  Programming  Enrironments. 

Recent  evolution  of  high-powered  and  (relatively) 
high-priced  personal  workstations  have  produced 
examples,  such  as  LISP  machines  and  the 
Smalltalk-80  system,  of  integrated  programming 
environments .  Proponents  of  these  systems  assert, 
with  considerable  informal  evidence  in  support,  that 
the  new  environments  allow  users  to  be  more 
productive  in  designing,  implementing  and  testing 
new  software.  Specific  features  that  distinguish 
integrated  programming  environments  from  earlier 
systems  include; 

•  the  user's  processes  operate  in  a  single, 
persistent  memory  space  (in  contrast,  for 
example,  to  communicating  via  files); 

•  the  environment  is  based  on  a  single  language 
and  corresponding  set  of  programming  facilities, 
for  user-written  and  system  facilities  alike; 

•  system  facilities  (the  "browser"  in  Smalltalk) 
allow  users  to  examine,  debug  and  change  all 
the  programs,  user  or  system,  in  a  highly 
interactive  way. 

The  intent  is  to  make  the  complete  system  easily 
visible,  testable  and  open  to  user  change,  via  a 
single  integrated  programming  environment. 

It  is  useful  to  compare  this  approach  to  the 
environment,  which  represents  a  popular  current 
approach  to  interactive  programming  environments 
le.g.,  4). 

•  processes,  in  most  cases,  operate  in  separate 
address  spaces  and  communicate  via  files  and 
file-like  connections; 

•  the  environment  emphasizes  the  use  of  multiple 
languages  (e.g.,  S,  the  shell  programs,  C, 
FORTRAN,  awk,  ...),  and  especially  the 
development  and  use  of  small,  independent 
software  tools. 

•  the  most  important  virtue  of  the  environment, 
for  many  uses,  is  that  it  does  not  get  in  the  way, 
but  provides  a  relatively  clean  and  simple 
computing  model  in  which  users/programmers 
can  do  what  they  want; 

•  on  a  mundane  level,  is  portable  to  a  wider  range 
of  computers,  including  many  that  are  an  order 
of  magnitude  le.ss  expensive  than  current 
integrated  workstations. 


Parallel  to  the  programming  environment 
distinction  is  a  dichotomy  in  programming 
languages.  Languages  like  LISP,  Smalltalk 
(regarded  as  a  language)  and  Prolog  are  popular 
for  the  integrated  programming  environments. 
Conversely,  languages  like  C,  the  Algol  family,  the 
FORTRAN  family  and  Pascal  are  associated  with 
"conventional"  systems.  If  we  label  the  two  families 
of  languages  inleraclive  and  algorithmic,  we  can 
list  characteristic  contrasts: 

•  interactive  languages  tend  to  be  used  to  build 
interactive  systems,  algorithmic  languages  to 
build  algorithms  or  specific  programs; 

•  algorithmic  languages  tend  to  use  scientific 
notation,  interactive  languages  some  syntax 
related  to  logic,  the  lambda  calculus  or  related 
forms; 

•  interactive  languages  tend  to  bind  during 
execution  (dynamically),  algorithmic  languages 
tend  to  some  form  of  compile-and-load; 

•  most  telling  of  all,  probably,  the  families  have 
different  definitions  of  virtue:  ease  of  use  and 
adaptability  for  interactive  languages  versus 
correctness  and  efficiency  for  algorithmic 
languages. 

Our  interest  is  not  to  make  a  judgement  of  merit 
between  the  two  approaches.  Rather,  we  want  to 
understand  what  features  of  each  are  most 
important  to  computing  for  data  analysis,  and  how 
to  obtain  them. 

Simply  put,  we  would  like  the  best  of  both.  As  the 
discussion  in  section  2  indicated,  advantages  of 
simple,  highly-interactive  program  design  and 
modification  are  very  relevant  to  analytical 
computing.  The  learning  barriers  imposed  by 
having  to  use  several,  partially  inconsistent 
languages  and  a  variety  of  (none  too  powerful) 
debugging  tools  seriously  inhibit  the  development  of 
applications  systems.  On  the  other  hand,  data 
analysis  and  graphics  depend  on  a  variety  of 
algorithms  and  software  tools  for  numerical 
calculations,  graphics,  and  report  generation.  We 
estimated  about  50,000  lines  of  support  code  for  S 
(3l.  A  sizable  fraction  of  that  represents  careful 
algorithmic  design  and  implementation  (usually  not 
by  us).  Not  having  access  to  the  languages  like 
FORTRAN  and  C,  in  which  such  algorithms  are 
written,  would  be  crippling.  Even  if  we  had  the 
energy  to  re-implement  the  algorithms,  a  prudent 
user  would  hesitate  to  trust  the  result,  without  a 
long  sequence  of  testing. 


In  summary,  both  kinds  of  virtue  are  important  in 
data  analysis.  We  want  ease  of  use,  but  we  also 
need  access  to  a  variety  of  reliable  algorithms  and 
tools. 

4.  An  Integrated  Programming  Environment  for  Data 
Anaiysis. 

We  believe  that  a  consistent  and  achievable  mixture 
of  the  virtues  of  both  approaches  outlined  in  section 
3  will  provide  a  substantially  improved 
environment.  Research  is  proceeding  at  Bell 
Laboratories  on  such  an  environment,  using  the 
experience  with  S  as  a  starting  point.  This 
section  briefly  describes  the  new  environment,  a 
prototype  of  which  has  been  written  by  the  author. 
The  essential  characteristics  of  the  environment  are: 

•  a  single  analytical  language,  similar  to  the  user 
language  in  S  but  allowing  dynamic  definition 
and  modification  of  functions; 

•  a  browsing  and  debugging  environment  in  the 
same  language; 

•  explicit  inter-language  interfaces  that  allow  the 
use  of  existing  or  new  algorithms  written  in 
languages  like  C  and  FORTRAN. 

•  similarly,  an  interface  to  the  operating  system 
tools  (e  g.,  via  the  pipe  mechanism  (4,  page 
190j); 

This  environment  can  have,  for  the  applications 
developer,  much  of  the  flavor  of  an  integrated 
programming  environment.  Such  developers  will 
only  rarely  need  to  design  algorithms  or  tools; 
rather,  they  will  tend  to  use  such  software  when  it 
comes  along.  For  their  own  design,  the  analytic 
environment  will  be  much  more  effective. 

As  with  the  Interlisp  or  Smalltalk  environments, 
programmers  have  access  to  the  definition  of  the 
language  interactively.  In  our  system,  we  make  use 
of  the  general  hierarchical  data  structures  to 
maintain  both  the  definition  of  operators  and 
functions  in  the  language  and  al.so  the  tree  of 
partially  evaluated  expressions  during  evaluation, 
all  within  the  language  itself.  The  fundamental 
operations  o*"  parsing,  code  generation  (that  is, 
optimisation  of  the  parsed  expression),  and 
evaluation  are  themselves  acce.ssible  as  functions  in 
the  language.  In  particular,  there  exists  a  definition 
of  the  semantics  of  evaluation  written  in  the 
language.  The  prototype  has  a  rudimentary  version 
of  a  debugger,  also  written  in  the  language. 
Important  building  blocks  are  datasets  for  the 
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evaluation  tree  mentioned  above  and  for  the  history 
of  the  user's  interaction.  Various  functions  use 
these  datasets  to  examine  and  control  evaluation: 
for  example,  a  menu-oriented  browser  examines  the 
evaluation  tree  (or  any  other  hierarchical  dataset), 
with  facilities  for  editing  any  piece  of  the  dataset. 

The  operators  and  functions  in  the  language  have 
definitions  in  the  language.  For  efficiency,  some 
functions  are  built-in  (implemented  by  compiled  C 
code),  but  equivalent  definitions  in  the  language 
exist  (as  in  the  case  of  evaluation  itself),  to  permit 
verification  or  user  modification.  However,  it  is 
explicitly  expected  that  algorithms  for  numerical, 
graphical  and  other  calculations  will  be  supplied  as 
interfaces  to  C  or  FORTRAN  code.  Several 
approaches  to  implementing  this  interface  are 
possible.  The  current  prototype  uses  special 
functions  in  the  language  that  map  into  suitable 
calls  to  subprograms  in  C  or  FORTRAN  (or  other 
low-level  languages  if  needed).  A  table  of  currently 
used  subprograms  and  a  C-language  routine  that 
executes  the  actual  subprogram  call  are  generated 
by  a  function  in  the  high-level  language,  from  the 
parsed  code  for  interpreted  functions  that  interface 
to  C  or  FORTRAN.  New  interpreted  functions 
that  do  not  invoke  previously  unseen  subprograms 
do  not  require  any  special  consideration.  The  best 
approach  to  invocations  of  new  subprograms 
depends  on  the  availability  of  dynamic  loading  in 
the  local  version  of  the  loader. 

Initial  studies  of  the  new  system  ind  .ite 
substantial  improvemerts  in  run-time  efficiency,  by 
comparison  to  similar  computations  in  S,  for  many 
typical  calculations  found  in  application  systems 
built  on  S.  Future  work  will  include  studies  of 
trade-offs  between  the  ability  to  redefine  everything 
dynamically  and  the  desire  to  speed  up  a  particular 
calculation.  For  example,  while  a  function  could  be 
redefined  within  a  loop  and  then  reused  in  that 
same  loop,  this  seems  generally  unlikely.  (One  can 
construct  somewhat  practical  examples  where  it 
would  make  sense,  however;  for  example,  when  a 
method  is  being  modified  based  on  previous 
iterations  of  the  same  methcxi.)  Given  the 
assumption  that  function  definitions  remain 
constant,  the  code-generation  phase  can  perform 
some  optimizations  of  argument  matching  and 
other  computations.  Before  deciding  what  options 
to  pursue  in  these  directions,  we  plan  to  .study  the 
performance  of  typical  application  computations  to 
look  for  the  important  "hot  spots”. 

In  summary,  experience  .so  far  has  been 
encouraging  that  a  programming  environment  can 


be  designed  to  combine  the  ease  of  use  and 
modification  found  in  integrated  programming 
environments  with  the  access  to  algorithms  needed 
for  quantitative  work  and  with  sufficient  run-time 
efficiency  to  support  a  variety  of  applications 
development. 
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are  more  suited  to  data  analysis  than  conventional  operating  systems  (such  as  Unix) 


THE  MONTE  CARLO  PROCESSOR 

DESIGNING  AND  IMPLEMENTING  A  LANGUAGE  FOR  MONTE  CARLO  WORK 
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In  designing  a  new  computer  language  for  Monte  Carlo  Experimentation,  one  needs  to  in¬ 
clude  high  level  data  structures,  a  large  family  of  functions  to  generate  random  quan¬ 
tities  and  a  wide  range  of  control  structures,  but  that  Is  really  the  very  least  of  it. 
Monte  Carlo  experiments  should  he  designed.  Just  like  any  other  experiment,  and  hence 
a  Monte  Carlo  Language  should  have  a  construct  which  can  describe  and  perform  a  com¬ 
plicated  experiment.  In  fact,  it  encourages  researchers  to  design  their  experiments 
more  carefully.  Monte  Carlo  work  tends  to  be  computationally  Intensive,  and  hence  a 
Monte  Carlo  I.anguage  cannot  afford  to  be  too  inefficient.  The  Monte  Carlo  field  is 
continuing  to  advance,  and  hence  a  new  language  must  be  able  to  adapt  to  changes. 

The  Monte  Carlo  Processor  Is  a  computer  package  designed  to  do  Monte  Carlo  Experiment¬ 
ation.  The  heart  of  this  package  Is  a  computer  language  called  MCL,  which  Is  a  descen- 
dent  of  the  languages  C  and  S.  It  is  designed  expressly  for  Monte  Carlo  Integration 
and  Experimentation  and  more  care  has  been  spent  on  such  issues  as  accuracy  and  com¬ 
patibility  with  existing  statistical  software  than  Is  found  In  the  existing  discrete- 
value  simulation  languages,  GPSS,  Simula  and  Slmscrlpt.  Unlike  S,  It  is  translated 
into  FORTRAN  and  then  compiled,  and  hence,  considerably  more  efficient.  The  MCL  lan¬ 
guage  contains  statements  which  describe  experimental  design,  variance  reduction  tech¬ 
niques,  random  variable  generation  that  are  not  found  in  more  conventional  higher  level 
languages  such  as  FORTRAN  or  Pascal. 


1.  INTRODUCTION 

Wliile  there  is  a  need  to  Improve  the  computer 
systems  we  use  to  analyze  data,  there  is  an  even 
greater  need  to  Improve  the  systems  wo  use  to  do 
Monte  Carlo  Experiments.  The  use  of  Monte  Carlo 
experiments  In  statistical  research  has  In¬ 
creased  in  recent  years  and  Is  now  a  fixed  part 
of  statistical  research.  A  recent  article  sur¬ 
veying  the  use  of  Monte  Carlo  metliods  in  recent 
years  (Hauck  and  Anderson,  1984)  claims  that 
about  20%  of  the  articles  published  In  1981  in 
several  of  the  major  Journals  (.IAEA,  Applied 
Statistics,  Biometrics,  Blometrika  and  Techno¬ 
metrics)  contained  some  form  of  Monte  Carlo 
technique  used  to  justify  their  metliods  or  re¬ 
sults.  There  are  several  reasons  for  the  In¬ 
creased  use  of  Monte  Carlo  techniques  in  statis¬ 
tical  research.  One  is  the  increased  accessi¬ 
bility  of  computers  In  the  past  decade.  Anotlier 
is  greater  prevalence  of  compiit at  ional  ly  inten¬ 
sive  techniques  such  as  the  bootstrap.  Certain¬ 
ly  the  most  important  reason,  though,  is  the 
changing  nature  of  statistics.  Statisticians 
are  now  trying  to  find  the  properties  of  statis¬ 
tics  in  situations  where  the  mathematical  assump¬ 
tions  make  the  problem  of  determining  the  power 
of  a  test,  or  the  variance  or  bias  of  an  esti¬ 
mator  very  difficult  If  not  completely  intract¬ 


able.  The  field  of  robust  methods  has  contri¬ 
buted  in  this  respect  because  researchers  in 
that  field  are  often  InterevSted  in  studying  the 
properties  of  statistics  when  the  nice,  mathe¬ 
matically  tractable  assumptions  break  down. 

The  foundation  work  that  has  been  done  to  pro¬ 
duce  statistical  analysis  packages  such  as  SAS, 
BMDP,  Mlnitab,  S,  as  well  as  other  packages. 

Just  hasn’t  been  done  for  Monte  Carlo  work. 

Tills  In  part  has  something  to  do  with  the  nature 
of  Monte  Carlo  experimentation.  The  process  of 
doing  a  Monte  Carlo  experiment  has  more  steps  In 
It  than  the  process  of  analyzing  data.  One  has 
to  decide  on  a  question,  or  set  of  questions  to 
answer  with  a  Monte  Carlo  experiment,  design 
that  experiment,  write  a  program  to  perform  that 
experiment  and  finally,  analyze  the  results  of 
that  experiment.  There  are  simply  more  parts 
to  the  process  of  doing  Monte  Carlo  work  than 
there  often  are  to  the  process  of  analyzing 
data.  There  Is  more  ciiolce  in  liow  one  puts  the 
parts  of  an  experiment  together  in  a  computer 
system. 

In  their  article,  Hauck  and  Anderson  (1984) 
point  out  several  problems  In  many  Monte  Carlo 


Tlie  author  wishes  to  thank  Catherine  Hurley  for  her  help  In  preparing  tlie  talk  and  tills  paper, 
Andreas  Buja  and  Richard  Krcnimal  for  listening  to  the  author  as  he  sorted  out  this  project  and 
Daijln  Ko  for  his  help,  friendship  and  support. 

Questions  concerning  this  system  may  be  addressed  to  the  author  at  the  University  of  Wasliington, 


Studies.  Researchers  often  use  Inferior  algo¬ 
rithms  to  generate  random  numbers  and  do  other 
casks.  Many  times  they  are  unaware  of  the  prop¬ 
erties  of  the  algorithms  they  use.  Hauck  and 
Anderson  point  out  an  article  published  In  1981 
containing  a  Monte  Carlo  Study  which  made  use 
of  the  random  number  generator  RANDU,  an  algo¬ 
rithm  whose  inferior  properties  have  been  known 
for  13  years.  In  addition,  they  don't  always 
carefully  design  their  experiments  to  explore 
the  properties  of  the  statistics  in  the  situa¬ 
tions  in  which  they  are  interested.  Parameters 
are  chosen  in  haphazard  ways  which  don't  allow 
the  researchers  to  draw  the  kinds  of  conclu¬ 
sions  that  they  would  like  to  draw.  Finally, 
researchers  don't  analyze  the  results  of  their 
experiments  with  the  kind  of  care  that  they 
should  bring  to  a  data  set.  Often  results  are 
published  in  tabulated  form  with  little  analysis, 
graphical  display  or  summary.  My  system  is  de¬ 
signed  to  attempt  to  address  these  concerns. 

The  core  of  the  system  works  within  the  frame¬ 
work  of  the  traditional  statistical  Design  of 
Experiments  setting.  The  experimenter  is  ex¬ 
pected  to  design  an  experiment  to  answer  some 
question  about  a  statistic  or  family  of  statis¬ 
tics.  My  system  will  take  a  description  of 
that  experiment  and  perform  it.  It  will  also 
take  the  data  produced  bv  that  experiment  and 
load  it  Into  a  statistical  package,  such  as  S 
or  Mlnltab,  for  analysis. 

2.  BASIC  PROBLEM 

Before  describing  the  system  which  I've  been 
working  on,  it  might  be  a  good  idea  to  present 
the  type  of  problem  which  the  system  is  designed 
to  solve.  There  are  at  least  three  kinds  of 
Monte  Carlo  simulations  done  in  this  world: 

Monte  Carlo  Experimentation  to  determine  the 
properties  of  some  kind  of  statistical  procedure. 
Discrete  Event  Simulation  in  which  a  Queuing 
Network  or  Flow  Chart  Model  is  simulated  on  a 
computer  and  Monte  Carlo  Integration  in  which  a 
complicated,  multidimensional  integral  (such  as 
ones  found  in  Nuclear  and  Particle  Pliyslcs)  is 
estimated  using  Pseudo  or  Quasi  Random  Numbers. 

My  system  is  designed  to  tackle  the  first  kind 
of  simulation,  what  I  call  Monte  Carlo  Experi¬ 
mentation,  although  many  of  the  Monte  Carlo 
problems  done  in  Physics  could  be  handled  by  it. 
There  are  several  systems  to  do  Discrete  Event 
Simulation.  While  statistical  Monte  Carlo  Ex¬ 
perimentation  could  be  done  and  while  I  want  my 
system  to  have  the  capabilities  to  do  it.  It 
presently  lacks  certain  attractive  features.  In 
order  to  do  Discrete  Event  Simulation  special 
data  Structures  such  as  queues  and  calendars, 
coroutines  and  some  kind  of  clock  mechanism  are 
often  desirable.  These  capabilities  to  do  Dis¬ 
crete  Event  Simulation  will  be  Installed  at  a 
later  date. 

It  might  be  best  to  start  with  a  simple  example 
to  illustrate  the  kinds  of  problems  In  Monte 
Carlo  Experimentation.  Suppose  that  we  would 
like  to  compare  to  estimators  of  location,  the 


sample  meditin  and  the  10%  trimmed  mean.  We  are 
Interested  In  deciding  which  is  the  better  of 
these  two  estimators  and  in  particular  we  would 
like  to  know  which  one  Is  better  to  estimate  the 
location  of  a  small  sample  of  data  which  comes 
from  a  symmetric  long  tailed  distribution.  In 
this  case  we  will  decide  that  the  estimator 
which  has  the  smallest  variance  is  the  best  and 
we  will  use  the  Contaminated  Normal  family  of 
distributions  to  study  these  estimators.  (For 
our  purposes  here,  a  Contaminated  Normal  distri¬ 
bution  is  a  distribution  In  which  an  observation 
comes  from  a  Standard  Normal  (1  -  7)  100%  of  the 
time  and  7  100%  of  the  time  from  a  Normal  dis¬ 
tribution  with  a  variance 

Hie  general  technique  we  will  use  will  be  to: 

1)  Generate  a  set  of  random  data  on  the 
computer  having  a  Contaminated  Normal 
distribution. 

2)  Apply  the  median  and  the  10%  trimmed 
mean  to  the  random  data  set. 

3)  Replicate  the  process  of  generating 
data  and  applying  the  statistics, 
thereby  collecting  many  estimates  of 
the  median  and  10%  trimmed  mean. 

4)  Calculate  the  sample  variance  of  our 
sample  of  medians  and  of  our  sample  of 
10%  trimmed  means. 

5)  Study  the  results. 

The  plan  for  our  experiment  is  not  quite  com¬ 
plete.  The  Contaminated  Normal  family  of  dis¬ 
tributions  has  two  parameters,  the  fraction  of 
contamination,  7,  and  the  contamination  var¬ 
iance,  o^  .  There  Is  the  additional  parameter 
of  the  size  of  our  data  sets,  which  we  will  call 
K.  We  must  choose  values  for  these  parameters 
and  organize  these  values  Into  a  Designed  Ex¬ 
periment.  We  will  choose  six  samples  sizes,  10, 
20,  30,  40,  50,  100.  We  are  Interested  in  small 
sample  sizes,  which  the  values  10  to  50  repre¬ 
sent,  but  to  make  sure  that  we  include  all  the 
sizes  of  interest  we  choose  one  larger  sample 
size,  100,  The  design  of  our  experiment  will 
Involve  looking  at  every  possible  combination  of 
the  parameters,  what  Is  called  a  factorial  de¬ 
sign  In  the  Design  of  Experiments  literature. 
Finally,  we  have  to  choose  the  number  of  times 
we  will  replicate  the  experiment  for  each  design 
point.  A  number  of  criteria  are  Involved  In 
choosing  that  value,  most  notably  the  amount  of 
computing  resources  we  have  available,  the 
amount  of  time  we  have  open  to  us  and  the  vari¬ 
ance  we'd  like  our  final  results  to  have  (in 
this  case,  the  variance  of  our  estimates  of  the 
variance  of  the  median  and  10%  trimmed  mean). 
Often,  the  number  of  replications  will  have  to 
be  chosen  from  results  of  a  small  pilot  study, 
a  short,  small  version  of  the  Monte  Carlo  Ex¬ 
periment.  For  our  example,  we  will  choose  the 
number  of  replications  to  be  5000  for  each  point 
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In  our  design.  A  brief  summary  of  our  experiment 
Is  found  below  in  figure  1. 


Statistics: 

median 

10%  trimmed  mean 


Hie  system  which  I  have  been  working  on  to  do 
Monte  Carlo  Experimentation  produces  code  which 
performs  this  program  (or  slightly  more  compli¬ 
cated  versions  of  it). 

3.1  PROBLEM  TO  BE  SOLVED 


Design: 

Factorial 

Distribution: 

Contaminated  Normal  Distribution 
Parameters: 

K  (Sample  Size)  10.20,30,40,50,100 
y  (Percent  Contamination) 

.01,  .05,  .1 

<7^  (Standard  Deviation  of 

Contamination)  2,5,10,100 

Replications: 

5000 


Figure  1 


3.  TYPICAL  PROGRAM 

Tlie  basic  form  of  the  program  needed  to  perform 
our  example  experiment,  or  Indeed  any  Monte 
Carlo  Experiment  is  seen  In  Figure  2.  It,  at 
base.  Is  a  pair  of  nested  loops.  The  Innermost 
code  is  contained  In  a  Replication  Loop,  That 
process  of  generating  data  and  evaluating  sta¬ 
tistics  Is  replicated  a  certain  number  of  times 
(In  the  case  of  our  example,  5000)  for  each 
point  in  the  design.  The  results  of  Interest  to 
the  experimenter  (In  our  case  the  variances  of 
the  median  and  10%  trimmed  mean)  are  then  com¬ 
puted  and  stored.  The  outer  loop,  the  design 
loop,  performs  the  experiment  for  each  point  In 
the  design  (In  our  case,  a  factorial  design). 


Basic  Program 


The  example  described  above  points  out  many  of 
the  characteristics  of  Monte  Carlo  Experimenta¬ 
tion,  which  my  system  tries  to  take  Into  account. 
Perhaps  the  most  important  Is  that  In  Monte 
Carlo  Experimentation  we  are  studying  a  Mathe¬ 
matical  Model  for  its  own  sake.  We  are  not 
studying  an  approximation  to  a  physical  system. 
This  has  several  Implications  for  our  system. 

We  have  strong  mathematical  assumptions.  Our 
choice  of  distributions  is  done  on  a  mathemati¬ 
cal  basis,  not  necessarily  because  they  approxi¬ 
mate  a  physical  system,  and  hence  the  algorithms 
which  must  produce  those  random  values  must  be 
provably  correct.  We  have  a  careful  experiment¬ 
al  design  because  we  are  interested  in  the 
changes  In  the  mathematical  hypothesis.  Finally, 
we  have  one  thing  which  simplifies  things  some¬ 
what  for  us,  we  are  not  doing  Discrete  Event 
Simulation  and  hence  we  don’t  have  to  worry 
about  the  problems  of  programming  a  system  which 
needs  to  handle  events  occurring  In  time.  We 
might  be  programming  a  problem  which  Involves 
timeseries  or  time  In  a  relatively  simple  way 
such  as  that  but  we  need  not  be  concerned  with 
queues,  servers  or  calendars  or  the  problems 
associated  with  them.  There  is  no  need  for  a 
clock  mechanism  or  coroutines. 

4.  THE  MONTE  CARLO  PROCESSOR 

The  real  purpose  of  this  paper  is  to  describe  a 
new  system  which  T  have  been  building  for  the 
purpose  of  doing  Monte  Carlo  Experimentation. 
That  system  takes  as  input,  a  description  of  a 
Monte  Carlo  Experiment,  and  produces  as  output  a 
program  to  perform  that  experiment.  After  per¬ 
forming  that  experiment,  the  system  then  puts 
the  results  of  the  experiment  Into  the  form 
which  a  conventional  statistical  package  can 
read.  The  researcher  can  then  analyze  the  data 
produced  by  the  experiment.  The  structure  of 
the  system  is  diagrammed  In  a  flowchart  in 
figure  3. 


Monte  Carlo  Processor  Sfruefure 


User 

llnlertace 


PrografTi 

FORTRAN 

Converter 

Generalorl 

Program 

Routine 

The  core  of  the  system  Is  the  Program  Generator. 
It  is  a  compiler  which  takes  as  input  a  program 
which  describes  a  Monte  Carlo  Experiment.  This 
program  is  written  In  a  new  language  designed 
especially  for  this  project.  The  syntax  of  this 
language  is  very  similar  to  the  language  S 
(Becker  and  Chambers,  198A)  and  hence  similar  to 
the  language  C  (Kernlghan  and  Richie,  1977). 
There  have  been  a  few  syntactic  additions  to  en¬ 
able  a  researcher  to  easily  describe  a  Monte 
Carlo  experiment.  There  are  statements  to  de¬ 
scribe  the  design  of  the  experiment  (the  DESIGN 
statement),  the  parameters  for  the  design,  the 
number  of  replications  for  each  design  point  and 
the  quantities  to  be  stored  and  accumulated  for 
later  analysis  (the  RETURN  statement).  The  lan¬ 
guage  is  a  functional  language.  New  functions 
can  be  added  with  relative  ease  by  anyone  with¬ 
out  recompiling  the  whole  system. 

An  example  of  the  code  is  shown  in  figure  4  be¬ 
low.  It  is  a  program  to  perform  the  experiment 
involving  the  median  and  the  lOZ  trimmed  mean 
which  was  described  above. 


Example  Code 

array(x,100) 

design!  factorial; 

k=(10.20.30.40, 50,100), 


X  <-  rnorm{lt)* 

(rber(k,p)*(s>g'l)+l) 

relumjvar,  inedian(x)  { 
relurn}var:  inean(x,.05)  { 


Figure  4 

The  first  statement  is  the  declaration  of  stor¬ 
age  to  hold  the  dataset.  The  structure  x  is  a 
one  dimensional  array  with  maximum  length  100. 

It  can  have  any  length  between  1  and  100  and  the 
functions  In  the  system  will  use  only  the  amount 
of  data  actually  in  the  structure  at  any  time. 
This  language  supports  scalars,  single  dimen¬ 
sional  arrays,  multiple  dimension  arrays  and 
compound  structures  made  up  of  scalars  and 
arrays . 

The  second  statement  descrlbe.s  the  design  of  the 
experiment,  defines  the  parameters  and  states 
the  number  of  replications.  In  our  example, 
we're  doing  a  factorial  experiment  with  para¬ 
meters  k  (sample  size),  slg  (standard  deviation 
of  contamination)  and  p  (percent  of  contamina¬ 
tion).  The  experiment  is  replicated  5000  times 
for  each  design  point. 


The  third  statement  produces  a  sample  from  a 
contaminated  normal  distribution.  The  function 
rnorm  produces  a  sample  of  length  k  from  a 
standard  normal  distribution.  The  remainder  of 
tiie  statement  calculates  the  standard  deviation 
for  each  observation  in  the  dataset.  The  rber 
function  produces  a  set  of  k  bernoulll  (0,1) 
random  variables  which  are  1  with  probability  p. 

The  third  and  fourth  lines  of  code  calculate  the 
median  and  the  10%  trinsned  mean  of  each  dataset 
and  indicate  that  Che  system  Is  to  accumulate 
the  variances  of  those  means  and  medians. 

4.1  THE  GENERATED  PROGRAM 

The  output  from  the  Program  Generator  (the  com¬ 
piler)  is  a  FORTRAN  program  which,  along  with  a 
library  of  FORTRAN  routines,  actually  performs 
the  experiment.  The  compiler  for  this  system 
does  not  compile  directly  into  object  code. 

Tiie  FORTRAN  program,  along  with  the  FOKIKAN  li¬ 
brary  are  compiled  on  the  local  machine's  FORTRAN 
compilers.  The  prime  reason  for  this  scheme  is 
portability.  FORTRAN  is  one  of  the  better  de¬ 
fined  and  more  portable  languages.  This  system 
generates  very  conservative  FORTRAN  code,  keep¬ 
ing  close  to  the  FORTRAN  77  standard  and  uses 
only  well  defined  fixed  format  10.  This  will 
keep  the  system  from  being  tied  down  too  closely 
to  a  specific  machine.  One  can  also  send  the 
FORTRAN  output  program  to  a  high  speed  machine, 
such  as  a  Cyber  205  or  a  Cray,  for  execution  If 
the  local  computer  proves  too  slow  to  do  the 
desired  experiment.  FORTRAN  also  tends  to  be  a 
fairly  efficient  language.  Optimizing  compilers 
can  produce  code  which  is  quite  good.  Hence 
there  is  little  to  be  gained  by  having  this 
system  produce  object  code, 

4.2  THE  FORTRAN  LIBRARY 

The  FORTRAN  Library  is  a  collection  of  routines 
which  perform  much  of  the  work  to  do  the  Monte 
Carlo  Experiments.  It  contains  routines  to  gen¬ 
erate  random  numbers  from  various  distributions, 
do  matrix  calculations,  perform  many  of  the 
tasks  tliat  one  needs  to  do  for  computing  statis¬ 
tics  as  well  as  calculate  many  of  the  conven¬ 
tional  statistics.  They  have  been  carefully 
chosen  and  their  methods  and  properties  are  well 
documented.  This  library  is  easily  expandable. 
Using  a  simple  table  definition,  a  user  can  add 
a  function  to  the  library  that  will  be  recog¬ 
nized  in  the  language. 

4.3  THE  CONVERTER  ROUTINE 

The  Converter  Routine  manages  the  output  data 
from  the  experimentvS .  It  can  convert  the  data 
into  a  form  which  can  be  loaded  into  any  one  of 
the  statistical  packages  like  S,  I.sp,  Mlnltab, 
BMDP,  SAR  or  SPSS,  It  can  also  extract  subsets 
of  data  or  produce  simple  files  of  data  which 
can  then  be  fed  Into  any  application  program. 
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4.4  THE  USER  INTERFACE 

The  User  Interface  controls  the  whole  system. 
Using  it,  an  experimenter  can  define  new  experi¬ 
ments,  edit  old  ones,  start  experiments  running, 
terminate  experiments  or  temporarily  suspend  ex¬ 
periments  to  lighten  the  load  on  the  computer 
and  then  restart  them  later.  Within  the  User 
Interface,  the  experimenter  controls  the  Con¬ 
verter  Routine  and  can  edit  experimental  output, 
direct  output  to  a  statistical  package  or  edit 
output. 

5 .  St;^^HARY 

The  purpose  of  this  system  la  really  two-fold. 

It  attempts  to  unify  the  body  of  Information 
that  a  researcher  needs  to  do  statistical  Monte 
Carlo  Experimentation.  Often  an  experimenter 
cannot  do  a  good  experiment  without  searching 
the  literature  of  Statistics,  Computer  Science, 
Numerical  Analysis  and  Operations  Research.  Its 
second  goal  Is  to  Improve  the  way  Monte  Carlo 
Experiments  are  performed  and  analyzed.  It  does 
this  by  casting  the  whole  process  of  programming 
an  experiment  into  the  classic  design  of  experi¬ 
ments  framework  and  by  giving  the  experimenter 
the  support  to  help  analyze  the  results.  The 
result  is  to  give  a  researcher  greater  freedom 
in  preparing  and  performing  Monte  Carlo  experi¬ 
ments.  Rather  than  worrying  about  finding  good 
random  number  generators  or  the  details  of  coding 
a  particular  experimental  design,  the  researcher 
is  freed  to  work  on  questions  more  closely  re¬ 
lated  to  the  study  in  question.  There  la  more 
time  to  try  different  pilot  studies  to  test 
ideas  before  doing  a  big  Monte  Carlo  Experiment* 
It  Is  easier  to  consider  the  use  of  variance  re¬ 
duction  techniques,  which  may  speed  the  experi¬ 
ment  or  improve  its  accuracy.  Just  as  upper 
level  computer  languages  free  programmers  from 
being  concerned  with  many  of  the  details  of  pro¬ 
gramming,  this  Monte  Carlo  System  frees  experi¬ 
menters  from  the  details  of  programming  Monte 
Carlo  Experiments. 
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Choosing  Smoothing  Parameters  for  Density  Estimators 


David  W.  Scott 


Department  of  Mathematical  Sciences 
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For  data  analysis  in  one,  two,  and  three  dimensions,  nonparametric  density  estimation  has  proven  to  be  a  power* 
ful  tool.  A  major  practical  problem  in  density  estimati<m  is  the  choice  of  smoothing  parameters,  to  which  the 
estimates  are  quite  sensitive.  There  are  three  dilferent  approaches  for  choosing  a  smoothing  parameter,  assuming 
little  a  priori  information:  (i)  interactive  graphical  evaluMion  of  the  smoothness  of  the  density  estimate  or  its 
derivatives;  (ir)  minimisation  of  cross-validation  criteria;  and  (iii)  use  of  upper  bounds  as  in  oversmooihed  den¬ 
sity  estimates.  In  this  paper  I  describe  these  approaches,  review  theoretical  results,  and  examine  small-sample 
behavior. 


1.  Introduction 

Automation  of  decisions  required  in  statistical  procedures 
is  highly  desirable.  The  resulting  expert  systems  esn  be  widely 
circulated  and,  contrary  to  popular  belief,  arc  likely  to  stimulate 
growth  in  the  profession.  More  importantly,  these  systems 
encourage  the  user  to  understand  the  role  of  assumptions  in  sta¬ 
tistics!  models  and  how  to  cope  with  situations  where  those 
assumptions  fail.  Statistical  procedures  currently  recognized  as 
capable  of  dealing  with  a  broad  range  of  models  are  often  per¬ 
ceived  as  too  difficult  to  use  and  subjective.  The  subjectivity  is 
often  embodied  in  the  choice  of  a  few  parameters  whose  values 
reflect  the  expert’s  judgments  about  the  data’s  peculiarities. 

Multiple  linear  regression  provides  a  typical  example. 
This  IS  a  favored  statistical  procedure  because  it  is  fully 
automatic.  But  regression  can  only  be  viewed  as  sutomaiic  over 
a  very  limited  range  of  probability  models.  Usually  the  model 
must  be  expanded  to  deal  with  outliers,  inflviential  points,  and 
tran.sformaiion  of  variables  while  simultaneously  attempting  to 
select  an  optimal  subset  of  variables.  Box  and  Cox  (196-f)  intro* 
duced  an  ^ditional  parameter  for  each  variable  in  their  power- 
transformation  family.  Robust  regression  (llubcr  1973)  requires 
specification  of  an  influence  function,  which  in  turn  contains 
shape  parameters.  Handling  influential  points  requires  determin¬ 
ing  acceptable  levels  of  leverage  (Relsley,  Kuh  and  WeUeh 
1980).  Some  of  these  ideas  are  addressed  in  an  experimental 
expert  system  proposed  by  Gale  and  Pregibon  (1983).  Full  auto¬ 
mation  of  robust  regression  is  clearly  a  large  and  difficult  ta^k. 
especially  given  current  wisdom  echoed  by  Carroll  and  Ruppert 
(1985)  “that  robust  estimators  should  not  be  used  blindly.”  Hut 
it  is  clear  that  robust  regression  is  very  important  and  even  par¬ 
tial  automation  desirable. 

In  this  paper,  we  focus  on  automatic  parameter  selection 
algorithms  for  nonparametric  density  estimators  Ideally,  we 
desire  procedures  that  take  data  and  produce  a  nearly  optimally 
smoothed  density  estimator  for  finite  eavtple  eizen.  This  prob¬ 
lem  is  easier  than  the  regression  problem  because  nonparametric 
density  estimators  are  robust  (although  some  automatic  selec¬ 
tion  procedu.os  may  not  be).  Thus  we  may  hope  to  have  a  lim¬ 
ited  expert  system  for  density  eslimation.  In  what  follows  we 
survey  past  attempts,  describe  current  re.sults.  point  to  new 
results,  and  wonder  whether  in  five  years  the  consensus  will  be 
that  “nonparametric  density  estimators  should  not  be  used 
blindly.” 

I  will  limit  the  discussion  to  hislograni,  .series  and  kernel 
estimators,  paying  most  attention  to  tlie  histogram  and  to  ker¬ 
nel  estimators,  of  the  usual  form 

/(*) "  ^  i  <"> 


I  shall  consider  two  kernels:  the  Gaussian  kernel  (2ir)~*^^e~'*^ 
and  the  triweight  kernel  — smoothing 

parameter  in  Equation  (1.1)  is  the  bandwidth  h.  For  the  histo¬ 
gram,  the  smoothing  parameter  is  the  bin  width,  which  will  also 
be  denoted  by  h.  Smoothing  for  series  estimators  may  be  con¬ 
trolled  by  the  number  of  terms  in  the  series  expansion  or  by  a 
bandwidth  parameter  similar  to  A  or  both. 

The  quality  of  the  estimates  will  be  estimated  by  the 
integrated  mean  squared  error: 

IMSB  -  J . 

Scott  (1970)  showed  that  use  of  a  nonoptimal  smoothing  param¬ 
eter,  say  a  factor  c  times  the  optimal  parameter,  results  in  an 
IMSE  increased  by  the  factors 


(c*+2)/3c 


(c^+4)/5c 


for  the  histogram  and  kernel  methods,  respectively.  In  my 
experience,  reasonable  density  estimates  are  within  10%  of 
optimum.  Hence,  it  is  clear  that  only  a  fairly  narrow  range  of 
values  of  the  smoothing  parameter  is  acceptable  for  any  sample 
size,  even  n  =  IO*.  The  histogram  is  less  sensitive  than  the  ker¬ 
nel  method  to  choice  of  smoothing  parameter. 


2,  Survey  of  Pre-1980  Algorithms 
2.1,  Histogram  Methods 

The  first  automatic  rule  for  choice  of  a  smoothing  parame¬ 
ter  was  given  by  Sturges  (1926)  for  the  histogram.  His  proposal 
was  simple  and  elegant.  Consider  a  histogram  with  k  bins 
labeled  0, Then  an  “ideal”  histogram  would  have 
^(^  ■!»/)  points  in  the  j**  bln;  adding,  the  corresponding  sample 
size  is  n  —  C(4r-I,y)  =  2*'’.  Hence  the  number  of  bins  and 

bin  width  are  given  by 

k  =  1  f  log2«  (2.1a) 

and 

h  =  («omp/e  rtnpe)/k  ,  (2.1b) 

respectively.  This  rule  is  given  implicitly  or  explicitly  in  virtu¬ 
ally  every  introductory  textbook.  Often  the  advice  is  given  that 
a  histogram  should  have  between  5  and  20  bins  (from  which,  1 
suppose,  we  infer  that  all  samples  contain  between  2*  and  2'* 
points). 

Scott  (1079)  analyzed  the  IMSE  of  the  histogram  and 


(II) 


IMSE  =  J_  +  -^A=«(/')  +  0(n'')  ,  (2.2) 

HA  12 

where  denotes  the  squared  Lriiorta  of  tlie  function 
/?(<*)  =  7  4>{xf^x  , 


and  is  a  measure  of  the  “roughness”  of  The  first  term  in 
(2.2)  is  due  to  variance  and  the  second  to  bias.  From  (2.2)  it 
follows  that  optimally  (asymptotically) 

*•  =  |6/fl(/')|'/’n''/*  (2.3) 

Comparing  (2.3)  and  (2.1)  we  see  that  Slurges’  rule  asymptoti¬ 
cally  has  Tar  too  Tew  bins  and  that  the  ttfSB  (2.2)  is  dominaUd 
by  errors  due  to  bias. 

It  should  be  noted,  however,  that  Sturgcs’  rule  is  con¬ 
sistent,  although  not  of  optimal  order,  lienee,  consistency 
results  by  themselves  are  not  satisfactory. 

Sturges  based  his  arguments  on  the  assumption  that  the 
data  arc  nearly  Gaussian.  Tukey  (1977)  has  advocated  a  similar 
role  for  the  Gaussian  density  as  a  reference.  Scott  (1979) 
adopted  this  point  of  view  and  advocated  using 

A  =  3.5ffn'*^*  -  (2.‘l) 

Chen  and  Rubin  (1984)  have  shown  this  rule  is  consistent  if 
<  oo.  Within  the  class  of  densities  sati  fying  this  and 
other  technical  constraints,  rule  (2.4)  provides  estimates  of  the 
optimal  order.  However,  for  very  rough  densities  the  rule  can 
easily  provide  poor  (usually  oversmoothed)  estimates.  It  is 
interesting  to  note  that  the  textbook  advice  of  between  5  and  20 
bins  when  applied  to  Gaussian  data  corresponds  roughly  to 
50<  n  <  1500,  which  is  a  more  reasonable  range  of  sample  sires. 

Freedman  and  Diaconis  (1981)  proposed  a  more  robust  ver¬ 
sion  of  (2.4)  based  on  the  interquartile  range  {IQR)- 

A=2/Q/?n'‘^«  , 

which,  generally,  is  at  least  30%  smaller  than  (2.4). 


3.2.  Kernel  and  Series  Methods 

Kernel  estimators  (1.1)  were  intro<lurc<|  by  Rosenblatt 
(1956)  and  Parsen  (1962).  Several  authors  have  projKjsed  a  rule 
that  parallels  (2.4)  for  Gaussian  data  with  a  Gaussian  kernel: 

h  =  l.OOdn'^*  (2.5) 

This  follows  from  the  general  result  for  nonnogativc  kernels: 

IMSE  =  ^^^  +  jat!h*R(f")+0{n')  (2.0a) 

and 

*'  =  [R{i<]/oimr  (2»<>) 

Another  informal  procedure  involves  graphical  inspection 

of  estimates  for  a  decreasing  sequence  of  smoothing  parameters. 
Generally,  when  the  estimates  begin  to  di.splay  high  frequency 
noise,  a  good  choice  is  a  slightly  larger  smooihing  parameter; 
see  Tapia  and  Thompson  (1978)  for  some  examples. 


2.3.1.  Series 

The  first  modern  results  for  choosing  nearly  optimal  dala- 
based  smoothing  parameters  came  with  the  periodic  scries 

estimator: 


/(*)  = 


t  '".A 


where  lu,  int  weights  end  /,  are  estimates  of  the  Fourier 
coeflicienla  /,.  Kronmal  and  Tarter  (1968)  let  unel  and  pro¬ 
vided  unbiased  estimates  of  the  ehenfc  in  the  IMSE  as  m  was 
increased.  They  slao  provided  inclusion  rules  for  the  /,  terms. 
This  anticipated  the  general  unbiased  estimates  of  the  IMSE  by 
Rudemo  and  Bowman,  which  are  discussed  in  section  3.1. 
Unfortunately,  as  a  smoothing  parameUr,  m  is  a  fairly  crude 
choice.  Hence  the  elegance  of  this  result  was  somewhst 
obscured. 

Wahba  (197?,  1981)  shifted  the  smoothing  parameter  away 
from  m,  which  she  took  as  «/2,  to  a  continuously  varying 
(smoothing)  parameter  X  in  the  weights  w* : 

1 

“’*  i+H2wt)‘  ' 

Through  unbiased  estimates  of  /*  and  |/*|*,  Wahba  provided 
asymptotically  unbiased  estimates  of  /A/5B(X).  Wahba  pro¬ 
posed  plotting  IMSE{\)  and  choosing  X'  as  the  minimizer.  This 
is  essentially  the  thrust  of  modern  kernel  proposals,  which  dilfer 
by  providing  cxocUy  unbiased  estimates  of  the  IMSE  shifted  by 
a  constant.  Wahba’a  algorithm  haa  been  illustrated  in  her 
papers  and  more  exten.sively  analysed  with  Monte  Carlo 
methods  by  Scott  and  Factor  (1081).  But  the  basic  framework 
for  automatic  data-based  density  estimation  was  laid  with  scries 
methods. 


2.2.2.  Kernel 

U  is  well  known  that  series  estimators  may  be  re-expressed 
as  kernel  estimates.  For  data  in  acveral  dimensions,  the  kernel 
form  is  easier  to  deal  with.  In  addition,  very  efficient  algorithms 
for  large  samples  such  as  the  averaged  shllted  histogram  (Scott 
1985)  approximate  kernel  cstimatora.  Thus,  cross-validation  of 
general  kernel  estimators  is  required. 

The  first  attempt  at  cross-validation  of  kernel  estimators 
did  not  directly  address  IMSE,  but  used  a  modified  maximum- 
likelihood  criterion  (llabbema,  Hermans  and  van  den  Brock, 
1974;  Duin,  1976).  Hermans  and  llabbema  were  particularly 
interested  in  constructing  multivariate  kernel  estimates  of  m^i- 
cal  data.  Specially,  the  authors  proposed  a  leave-one-out  optim¬ 
isation  problem; 

max  5]  log/,(x,)  ,  (2.8) 


where  /.(i,)  is  the  kernel  estimator  with  i,-  deleted  and 
evaluated  at  z  =  x,'.  Scott  and  Factor  (1981)  found  the  small- 
sample  properties  of  (2.8)  with  Gaussian  data  were  quite  good, 
but  that  (2.8)  was  sensitive  to  outliers,  as  later  proved  by  Schus¬ 
ter  and  Gregory  (1981).  Schuster  has  also  proposed  a  related 

criterion  h.V5cd  on  random  splits  of  the  data,  which  seems 
promising  empirically.  The  difficult  proof  of  the  consistency  of 
(2.8)  was  provided  by  Chow  et  al.  (1983),  but  Hall  (1982) 
demonstrated  the  optimal  order  would  not  in  general  be  real¬ 
ised. 

in  1976,  Jim  Thompson  suggested  and  1  implemented  an 
algorithm  based  on  estimating  IMSE,  Notice  in  (2.6)  that  the 
only  unknown  quantity  is  R{f'  ).  VVe  eslimatcd  this  quantity 
by  substituting  the  kernel  estimate  itself,  which  for  a  Gaussian 
kernel  is  given  explicitly  by 

/?(/.")  =  Ai/I2)c  (2.9) 


where  — (x,- 
sequence: 


Xj)/h.  Then,  following  (2.6b),  we  formed  the 


A. 


m) 

A<ri«(A.”) 


I/s 


(2.10) 


(2.7) 


where  A,  and  are  the  current  and  next  iterates  of  A,  respec¬ 
tively.  We  could  have  substituted  (2.9)  into  the  IMSE  expres¬ 
sion  (2.6a)  and  proceeded  as  WaJiba  but  choose  instead  this 
fixed  point  iteration.  Not  surprisingly,  Scott  and  Factor  (1981) 
found  the  small  sample  performance  of  (2.10)  and  Wahba's  algo¬ 
rithm  to  be  quite  similar.  Unfortunately,  (2.9)  does  not  provide 
a  consistent  estimator  of  but  is  positively  biased  (for 

small  samples,  this  was  unimportant).  Removing  this  bias  gives 
an  algorithm  in  the  spirit  of  Wahba  (see  Scott  and  Terrell  1985; 
also  section  4.1).  As  an  aside,  R[f)  and  /?(/')  are  consistent, 
while  /?(/“')— *00  when  using  A’s  given  by  (2.6b). 

Silverman  (1978)  found  a  clever  way  to  use  the  incon¬ 
sistency  of  /a"  in  his  test  graph  procedure.  He  showed  that  the 
fluctuations  in  the  second  derivative  should  be  of  a  certain  fixed 
size  for  optimal  A.  By  examining  a  series  of  plots  of  f^"  (or  a 
decreasing  values  of  A,  the  size  of  the  fluctuations  ma^'  be 
guessed  and  an  A  chosen.  This  generalises  the  visual  inspection 
method  described  after  Exjuation  (2.6). 


3.  Algorithms  since  1980 

3.1.  Unbiased  eroae-validation 

A  new  twist  in  cross-validation  came  with  the  introduction 
of  exactly  (not  asymptotically)  unbiased  estimates  of  the  IMSIC 
by  Rudemo  (1980)  and  Ekiwman  (1981).  Consider  decomposing 
the  !MSE  =  E f[f  into  three  terms: 

lAfSE^E  J/{jfd^-2Ej/{x)/{i)di  +  J/{ifdi.  (3.1) 


i:/(x.).  (3.2) 

"  i-i  "  i-i 

The  authors  show  that  (3.2)  provides  an  unbia.sed  estimate  of 
the  first  two  terms  in  (3.1)  while  the  third  term  in  (3.1)  is  con¬ 
stant.  Plotting  (3.2)  provides  an  unbiased  (poinlwise)  estimate 
of  the  true  !MSE  curve,  shifted  by  the  fixed  (but  unknown)  con¬ 
stant  /?(/)•  Again  the  cross-validation  estimate  is  that  A  which 
minimizes  the  curve.  Hall  (1983)  and  Stone  (1981)  have  shown 
the  resulting  estimates  are  not  only  consistent  but  asymptoti¬ 
cally  optimal.  In  practice,  we  should  not  expect  very  much 
difference  between  (3.2)  and  Waliba’s  proposal,  since  the  bias  in 
Wahba's  IMSE  estimator  is  quite  small,  of  order  n  '^. 

This  proposal  has  several  remarkable  features.  First,  it  is 
applicable  to  any  density  estimator  of  the  generalized  kernel  or 
delta  type.  Thus  when  applied  to  histograms,  a  sequence  of 
smoothing  parameters  of  order  results,  while  the  sequence 

is  of  order  for  nonnegativc  kernels,  and  of  order  n  for 

appropriate  negative  kernels.  Second,  it  avoids  directly  estimat¬ 
ing  terms  such  as  /?(/')  in  (2.2)  and  includes  the  0(rt  '*)  terms  as 
well.  Third,  it  is  easily  extended  to  higher  dimensions. 


3.2.  Example 

For  a  histogram  estimator,  I  examined  the  performance  of 
(3.2)  with  very  large  samples  of  normal  data.  For  equally 
spaced  histograms  with  bin  counts  {ha},  we  must  minimize 

In  Figure  I,  1  have  plotted  o(A)  for  a  A(O.l)  sample  with 
n  —  10.000,  for  which  A*=*.162.  Exactly  where  to  place  the  bins 
i.«  a  little  problem,  and  I  have  chosen  zero  as  a  bin  edge  for  all 
the  histograms.  Notice  the  minimum  of  the  curve  is  close  to 
f?(/)~  I/2n/»= --2821.  But  the  amount  of  noise  in  the 
curve  is  (initially)  surprising.  We  are  actually  looking  among 
the  obviously  numerous  local  minima  for  the  best  k.  Now  it 
can  be  shown  that  the  variation  observed  in  Figure  1  approxi¬ 


mates  the  standard  deviation  of  the  curve  estimates  about  the 
true  IMSE  estimate  -  this  variation  is  much  less  than  the  vari¬ 
ance  of  the  curve  (3.3),  which  was  shown  by  Rudemo  (1980)  to 
be  of  order  0(n'');  sec  Scott  and  Terrell  (1985).  Thus  while 
the  actual  *‘besl"  local  minimum  is  quite  good  in  Figure  1,  we 
may  expect  a  large  percentage  of  A’s  to  be  outside  the  interval 
(.72A',1.35A*),  even  with  such  large  samples;  see  i*)quation  (1.2). 

The  corresponding  curves  in  the  kernel  case  do  not  exhibit 
the  variation  for  individual  samples  because  continuous  kernels 
av<Md  problems  due  to  the  bin  boundaries;  however,  the  large 
variation  exista  and  we  cannot  expect  to  obtain  an  A  with 
desired  accuracy  for  medium  sample  sites  with  desired  certainty. 
Thus  the  asymptotic  optimality  theorems  do  not  translate  into 
uniformly  good  small-sample  properties;  see  also  simulations  by 
Bowman  (1984). 


4.  Some  Recent  Work 

4.1.  Btaeed  Cross-Validation 

If  we  think  of  the  procedures  in  Section  3.1  as  "unbiased” 
cioss-validalion  algorithms,  then  it  is  natural  to  think  of 
Wahba’s  method  as  "biased”  cross-validation.  We  have  looked 
at  some  biased  cross-validation  algorithms  in  the  spirit  of  the 
Scott-Tapia- Thompson  procedure  for  histogram  and  kernel 
methods  (Scott  and  Terrell,  1085).  For  histograms,  we  estimate 

(4.1) 


and  substitute  in  (2.2)  to  obtain 

which  may  be  compared  to  Equation  (3.3).  in  Figure  2  (for  the 
same  sample  as  used  in  Figure  1)  we  plot  the  estimated  IMSE 
(4.2).  Notice  the  estimates  are  not  only  far  less  noisy,  but  also 
provide  a  good  estimate  of  the  true  integrated  squared  error. 
The  bias  introduced  is  of  lower  order  than  the  variance.  Thus 
the  roles  of  biased  and  unbiased  cross-validation  for  finite  sam¬ 
ple  sizes  are  not  yet  clear.  Examples  with  certain  mixture  densi¬ 
ties  are  more  favorable  to  the  unbiased  procedures  for  samples 
n  <  1000. 

For  a  fixed  sample,  both  (3.3)  and  (4  2)  converge  to  zero  as 
A— »oo.  Now  (3.3)  is  negative  near  A  *  but  (4.2)  is  clearly  nonne¬ 
gative.  Hence  (4.2)  is  actually  minimized  for  A*=oo:  we  seek  the 
local  minimizer  near  A*.  We  also  expect  A— oc  to  be  a  local 
minimum  for  (3.3).  For  small  samples,  the  region  in  the  neigh¬ 
borhood  of  the  local  minimum  where  (4  2)  is  convex  may  be 
very  small  or  nonexistent  when  u.sing  the  biased  methods.  This 
regitm  is  much  larger  for  unbiased  procedures.  Recall  the 
Scott-Factor  simulation  results  where  method  (2.10)  occasionally 
failed  to  have  a  solution.  For  these  ca-ses,  the  (oversmoothed) 
upper  bounds  given  below  are  very  useful. 


4.2.  Upper  Bounds 

Rules  (2.4)  and  (2.5),  which  are  ba.sed  on  Gaussian  models, 
turn  out  to  be  clase  to  upper  bounds  on  smoothing  parameters; 
see  Terrell  and  Scott  (1985).  Under  various  constraint.^  on  scale 
mea.sure,  densities  minimizing  may  be  found.  When  sub¬ 

stituted  into  expres.sions  such  as  (2.3)  and  (2.6b),  useful  upper 
bounds  may  be  obtained.  For  example,  a  histogram  of  a  density 
with  finite  support  (fl,6)  satisfies 

A  <  (6  o)/(2n)'''®  .  (4.3) 

(fseful  expres.sions  exist  for  densities  of  infinite  support,  as  well 
as  for  kernel  estimators.  Rules  based  on  Gaussian  models  are 
only  slightly  narrower. 


/*. 


Example  of  the  unbiased  cross-validation  function 
for  a  histogram  with  Normal  data  and  n  —  10,000. 


For  very  small  samples,  these  rules  are  probably  as  good 
a.s  any.  For  very  large  samples,  any  inefficiency  may  not  be 
important  —  we  may  be  willing  to  accept  an  ovcrsinoothed 
IMSE  of  10  *  even  though  the  optimal  IMSE  could  be  10  *. 
Thi.s  is  because  the  oversmoothed  density  estimates  will  contain 
the  important  features  of  the  true  density,  though  somewhat 
llattened. 


6.  Diacuaston 

Rice  (t98f)  has  investigated  cross-validation  results  for  the 
related  problem  of  nonparametrie  kernel  regres.sion.  Rut  that  is 
an  easier  problem  to  diagnose  graphieally,  since  the  curve  may 
be  compared  to  the  locations  of  the  points.  For  kernel  density 


estimates,  some  authors  suggest  comparison  with  a  histogram, 
but  which  histogram?  It  is  passible  to  compare  the  integrated 
kerne]  estimate  with  the  sample  cdf,  but  the  optimal  smoothing 
parameters  for  the  cdf  and  density  are  different.  So  cross- 
validation  for  the  density  is  apparently  not  as  easy  a  problem. 

The  univariate  methods  may  be  extended  for  choosing 
smoothing  parameters  for  multivariate  estimators  (one  for  each 
variable).  In  my  experience  where  I  choose  smoothing  parame¬ 
ters  by  eye,  I  find  the  multivariate  case  is  somewhat  easier  than 
the  univariate  case  because  of  interaction  effects,  which  help 
gauge  changes  in  the  density  function  for  each  parameter. 
Perhaps  cross-validation  in  this  case  will  be  no  harder. 


For  large  enough  n,  empirical  evidence  suggeaU  the  biased 
cross-validation  algorithm  works  almost  without  fail.  In  other 
words,  the  smoothing  parameter  obtained  is  acceptably  close  to 
the  optimal  value  for  almost  every  sample.  'I'his  should  be  con¬ 
ditioned  by  the  obvious  statement  that  the  true  density  may 
contain  very  minute  features  not  observable  without  more  data. 
But  such  possibilities  should  not  paralyze  our  willingness  to  use 
nonparametric  methods.  For  smaller  samples,  tlie  oversmoothed 
results  are  extremely  useful,  because  if  the  proposed  cross- 
validation  value  is  greater  than  or  much  smaller  than  the  upper 
bound,  a  clear  signal  for  closer  inspection  has  been  given. 

Can  an  expert  system  be  built  for  density  estimation? 
Yes,  but  as  in  the  parametric  regression  case,  it  probably  won't 
be  blind. 
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ON  A  CLASS  OF  MULTIVARIATE  DENSITY  AND  REGRESSION  ESTIMATORS 


V.  K.  Klonias 


Mathematical  Sciences  Dei^artment 
Tlic  Jolms  Hopkins  University 
Baltimore,  Maryland 

Wo  present  a  class  of  nonparamotric  multivariate  maximum  penalized  likelihood 
estimators  (MPLE)  of  a  probability  <knsity  functions.  The  estimates  are  multivariate 
splines  with  knots  at  the  sample  jx>ints.  The  numerical  effort  for  the  evaluation  of 
the  estimates  is  essentially  independent  of  the  dimension  of  the  data.  Under  mild 
assumptions  the  MPLE's  are  seen  to  be  consistent  in  a  variety  of  metrics  and  with 
optimal  rates  of  convergence.  These  density  estimators  lead  naturally  to  a  class  of 
multivariate  regression  estimators.  Some  numerical  examples  are  presented  where  the 
smoothing  parameters  are  estimated  from  the  data  by  an  approach  suited  to  these 
spl ines . 


1.  INTRODUCTION 

Let  in  Z  .  be  i.i.d. 

-i  -n  + 

ol)5crvations  from  a  distribution  function  F 
with  density  f  and  let  denote  the  associated 
empirical  distribution  function.  The  nonpara- 
m<'tric  maximum  penalized  likelihood  method  of 
density  estimation  (MPLE),  introduced  by  Good 
and  Gaskins  (1971  and  1980) ,  suggests 
estimating  f  by  the  maximizer  of  the  log- 
likelihood  minus  P(v),  a  roughness  penalty 
functional  which  is  usually  applied  on  the 

square  root  of  the  density  For  example, 

P(v)  =  a  f  (v')^  +  B  /  (v")^,  where  ^  0  and 

at  least  one  is  strictly  positive.  In  DcMon- 
tricher,  Tapia  and  Thompson  (1975),  the 
existence  and  uniqueness  of  the  MPLE’s  were 
rigorously  established  within  the  framework  of 

t)(c  Sobolev  spaces  =  {u^L^ill)  such  that 

II  II 2  +  «'},m€  where  L2(Il)  denotes 

tlie  space  of  square  integrable  functions  and 
II  u||^  ^  For  discretized  MPI.E’s  see 

Tapia  and  Thompson  (1978)  and  Scott,  Tapia  and 
Thompson  (1980).  For  f’onaltics  on  log  f  see 
Silv'^rman  (1982).  We  follow  hern  the  sotting 
in  Klonias  (1984),  and  discuss  the  construction 
of  the  multivariate  MPLE's,  their  consistency, 
numerical  evaluation  and  data-based  choice  of 
the  smoothing  parameters. 

2.  THE  ESTIMATORS 

For  the  estimation  of  the  probability  density 
in  a  noni'aramotric  setting,  the  likc*liho'>d  can 
be  c'-’fisidcred  as  a  functional  wit)»  argument 
rangiti'i  over  a  suitable  space  of  density 
functions  f.  If  no  smootlincss  conditions  are 
impos^'d  on  f,  the  likelDiood  is  unWnH)ded  and 
tlie  unconstraint  maximum  likelihood  "solution” 
can  I'C  represented  as  an  average  of  Dirac 
deltas  centered  at  t)ie  observations,  i.e.,  it 


is  the  distributional  derivative  A  of  F  .  In 

n  n 

fact,  the  classical  kernel  estimates  of  the 
density  can  be  viewed  as  approximations  to  S^. 

The  MPLE's  u  of  v  ■=  f'’  considered  Jiere  are 
solutions  to  the  following  optimization  problem 

(2.1)  maxlX*^  log  u(X.)^  -  A  /  |u|dM^ugH} 
1»1  -1 

subject  to  u(X^)  >  0,.,.,n, 
where  u  denotes  t)ie  Fourier  transform  of 

!  J  p|u|d(J  <  +  “),  tJ 

n 

is  a  i*ositive  measure  on  dominated  by  the 
Lobesguo  measure,  and  X  >  0  is  such  that 
2 

/  u  =  1.  The  optimization  problem  (2.1)  has 

IR^‘ 

a  unique  solution  given  implicitly  by 
u(x)  -  X  ^  ^^-i^  ^K(x-X^),  X  £ 


where  tl»e  function  k  is  determined  by  K  m  s=  1, 
m  =  U*,  i.e.,  the  MPLE  u  is  a  spline  function 
with  knots  at  the  sample  points.  The  smoothing 
parameters  enter  t?irough  the  penalty  functional 
by  lettirig  M  depend  on  0,  i.e.,  we 


consider  m(hj^tj,.. 

.  .,h  t  ) . 

P  p 

Then,  the  MPLE  u  is 

of  the  form 

(2.2)  u(x)  =  ! 

-  u(x.) 
i-l  -1 

■^h,...)i  )■* 

1  P 

k(  (Xj-X, 

. (x  -X 

P 

■ 

ip  )' 

p 

where  k  can  bo  any  real  function  on  R  ,  which 
integrates  to  one  anci  k  >  0.  The  MPLE  of  the 

density  function  is  f  =  u^. 

n 

The  flexibility  in  the  choice  of  the  penalty 
functional  in  (2.1),  allows  a  variety  of  kernels 
k  in  (2.2),  wliich  can  be  chosen  in  ways  that 
allow  for  clearer  definition  of  the  "peaks"  and 
"valleys"  of  the  density  estimates.  In 
particular  we  can  choose 


k(x)  =  (l-c(x'’’x-l))<(i(x),  x6  rP, 

where  <P  denotes  the  p-variate  standard  normal 
density,  a  kernel  corresponding  to  c=0.  For 
c=l,  essentially  we  subtract  u”  from  a  u  based 
on  resulting  in  a  spline  with  improved 
performance  at  the  concave  and  convex  parts  of 
the  density  surface.  The  value  of  c=4,  results 
in  a  kernel  with  zero  second  moment  and  in  an 
estimate  with  enhanced  rates  of  convergence. 
Note  that  in  the  last  two  cases  the  MPLE  u  may 
assume  negative  values  over  areas  that  the  data 
is  very  sparse.  The  density  estimate  f^  how¬ 
ever,  remains  a  proper  probability  density. 


Under  mild  moment  and  smoothness  assumptions  on 
the  underlying  density  f,  the  MPLE's  are 
consistent,  with  optimal  rates  of  convergence, 
in  a  variety  of  senses,  e.g.,  in  the  Hellinger 
distance,  uniform  and  Sobolev  (corre¬ 

sponding  to  H)  norms.  Analogous  results  can  be 

derived  for  the  derivatives  of  f  . 

n 


Note  that  once  u(x.»y.),  i=l,...,n  have  been 
-1  ‘1 

determined,  it  is  straightforward  to  compute  the 
corres{X3nding  nonparamctric  regression  estimator 

m(x)  =  /  yf^ (x,y)dy/ /  f^ (x,y)dy ,  for  details  see 

Klonias  (1984).  When  the  kernel  k  in  (2.2)  is  a 
product  of  univariate  kernels,  these  regression 
estimates  have  the  appealing  property  of  reducing 
to  the  classical  nonparametric  kernel  regression 
estimators,  when  the  smoothing  parameters  cor¬ 
responding  to  the  Y's  are  let  to  go  to  zero. 

For  example,  if  k(x,y)  =  kj^fx)k2(y)  the  MPLE  of 

m(x)  is  given  by 


m(x)  =  Z  ^  ,w  (x)Y.}/E?  ,w..{x), 

1=1  1*1  1  1=1  j=l  ij 

where,  ^ 

W^.(X)  = 


<V^2> 


'Y.-Y 


“2  / 


where 

)i_,  -*• 


m^(x) 


*  denotes  convolution.  Then,  letting 
0  we  obtain  the  kernel  regression  estimate 


)/{ 


"  k^ 
i=ri\ 


x-x. 


}. 


3.  NUMERICAL  EVALUATION  OF  THE  MPLE'S 


Note  that  (2.2)  defines  the  spline  u  implicitly 
and  wp  need  to  evaluate  u  at  the  sample  points. 
To  this  end  we  set  x  =  X.,  i=l,...,n  in  (2.2) 
and  ohtain  the  following  system  of  nonlinear 


equation : 

(3.1)  . 


(X.  -X.  )/h  ) , 

Ip  TP  P 

=  (lh,...h  )  '’u(X.J 

1  p  -i 


Note  that  the  q.'s  do  not  depend  on  X  which  is 
^  2 

then  determined  by  the  equation  /u  =  1,  i.e., 

^  =  . 

(X.  -X.  )/h  )  . 

XP  3P  P 


T 

Note  that  the  q  =  (q^,.»*»q^)  which  solves 

(3.1)  is  the  unique  solution  to  the  following 
optimization  problem: 

(3.2)  min  {q^  t  q  -  q^,  q€  r") 


subject  to  >  0,  i=l,...,n, 

where  the  {i,j)^^  entry  of  the  positive  definite 
matrix  ^  is  given  by; 


)c((X.^-X.^)/hi . (X.p-X,p)/hp). 


The  algorithm  we  use  to  solve  (3.2)  is  based  on 
a  truncated-Newton  method,  described  in  Nash 
(1982),  for  details  see  Klonias  and  Nash  (1983). 
Note  that  the  dimension  of  the  data  influences 
only  the  computation  of  so  that  the  numerical 
effort  of  solving  (3.2)  does  not  increase  sig¬ 
nificantly  with  the  dimension  of  the  random 
variable.  For  n=200  and  p=2 ,  the  solution  of 
(3.2)  requires  CPU  time  on  a  VAX  11/780  of  the 
order  of  40  seconds. 


For  the  data  based  choice  of  the  smoothing 
parameters  we  propose  to  max  {X  (h^ » . . .  ,h^^)  , 

h,,...,h  >  o),  for  details  see  Klonias  (1984), 

1  P 

In  the  graphs  that  follow,  when 

estimated  from  tlie  data,  were  chosen  as 

the  minimizers  of 


where  . 


FIGURE  1.  n=l00;  based  on  kernel  (3.3). 


where  u,  denotes  the  MPLE  and  u  denotes  the 
X  n 

solution  to  problem  (2.1)  with  X=n,  the  asymp¬ 
totic  value  of  X.  The  CPU  time  required  for  the 
numerical  evaluation  of  the  MPLE  with  data  based 
n=200,  is  of  the  order  80  seconds. 

The  data  for  the  graphs  that  follow  was 
generated  using  the  IMSL  routine  GGNML  with 
DSEED’s  255866175  and  1949292845. 


A 


FIGURE  4. 

« 

n-200;  based  on  the  standard  normal  I 

kernel.  * 

m. 


n“200;  data  based  choice  of 
based  on  kernel  (3.3). 


FIGURE  5.  The  N  (0,0;1,1;V)  surface. 


Figures  2,  5  and  8  are  the  underlying  surfaces 
which  we  estimate  by  the  surfaces  in  Figures  1, 
3,  4  and  in  Figures  6,  7  and  in  Figure  9  re- 
sp>ectively.  In  Figures  3,  6,  9  the  smootliing 
parameters  have  been  estimated  from  the  data, 
as  described  earlier.  Tlio  estimates  in  Figures 
1,  3,  6,  9  are  based  on  tlic  following  kernel: 
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This  paper  reviews  current  methods  of  categorical  data  analysis,  and  illustrates  how 
SAS®  software  can  be  used  to  perform  the  analyses.  Topics  include:  randomization 
methods  for  testing  hypotheses  under  a  minimum  of  assumptions,  linear  and  log- 
linear  modeling  of  categorical  responses,  weighted-least-squares  estimation  methods 
for  investigating  the  variation  among  functions  of  proportions,  maximum-likelihood 
estimation  using  Newton-Raphson  and  iterative  proportional  fitting,  repeated 
measures  analysis,  stratified  analysis,  logistic  regression,  and  the  analysis  of  data 
from  complex  sample  surveys.  Examples  of  each  type  of  analysis  are  given. 


1.  INTRODUCTION 

The  capabilities  of  SAS  software  for  categorical 
data  analysis  have  increased  dramatically  over 
the  past  few  years.  The  capabilities  discussed  in 
this  paper  are  available  in  Version  5  of  the 
software,  scheduled  for  release  in  the  middle  of 
1985.  The  primary  procedures  for  categorical 
data  analysis  are 

•  CATMOD  procedure  (replaces  FUNCAT) 

•  FREQ  procedure  (replaces  TFREQ) 

•  (ML  procedure  (replaces  MATRIX) . 

The  CATMOD  procedure  does  general  linear 
modeling  of  categorical  data,  including  linear 
models,  log-linear  models,  logistic  regression, 
and  repeated  measures  analysis.  The  FREQ 
procedure  does  analysis  of  association  and 
stratified  analysis.  The  IML  procedure 
encompasses  an  interactive  matrix  language  that 
makes  it  relatively  easy  to  program  any 
customized  analysis  that  is  desired. 

The  remaining  sections  of  this  paper  are  divided 
as  follows. 

2.  Two-Way  Contingency  Tables 

3.  Stratified  Analysis 

3.1  Partial  Association  Testing 

3.2  Estimation  of  Relative  Risk 

4.  General  Linear  Model  Analysis 

5.  Log-Linear  Models,  Maximum  Likelihood 

6.  Models  for  Ordinal  Data 

7.  Repeated  Measures  Analysis 

8.  Complex  Sample  Survey  Data  Analysis 

For  conservation  of  space,  the  printed  output 
displayed  in  this  paper  for  any  given  problem  is 
generally  only  a  small  portion  of  the  amount 
produced  by  the  procedure. 


2.  TWO-WAY  CONTINGENCY  TABLES 

For  two-way  contingency  tables,  PROC  FREQ 
does  an  analysis  of  association  that  is  divided 
into  two  or  more  parts.  The  first  part  contains 
test  statistics  and  p-values  for  testing  the  null 
hypothesis  of  no  association  between  the  two 
variables.  The  second  part  contains  measures  of 
association  for  estimating  the  strength  of  any 
association  that  might  be  present. 

In  choosing  measures  of  association  to  use  in 
analyzing  a  two-way  table,  one  should  consider 
the  study  design,  the  measurement  scale  of  the 
variables,  the  type  of  association  that  each 
measure  is  designed  to  detect,  and  any 
assumptions  required  for  valid  interpretation  of  a 
measure.  For  further  information  on  choosing 
measures  of  association  for  a  specific  set  of  data, 
see  Goodman  and  Kruskal  (1979),  or  Bishop, 
Fienberg,  and  Holland  (1975,  Chapter  11). 

Similar  comments  apply  to  the  choice  and 
interpretation  of  the  test  statistics.  For  example, 
the  Mantel  -  Haenszel  chi-square  statistic  requires 
an  ordinal  scale  for  both  variables,  and  is 
designed  to  detect  a  linear  association.  The 
Pearson  chi-square,  on  the  other  hand,  is 
appropriate  for  all  variables,  and  can  detect  any 
kind  of  association,  but  it  is  less  powerful  for 
detecting  a  linear  association  because  its  power  is 
dispersed  over  a  greater  number  of  degrees  of 
freedom  (except  for  2  by  2  tables). 

For  2  by  2  tables,  PROC  FREQ  also  computes 
estimates  of  relative  risk  and  their  corresponding 
confidence  intervals.  For  two  dichotomous 
variables,  D  and  E,  the  relative  risk  of  D  is 
defined  as 

RR  =  Prob(D=yes  I  E=yes)/Prob(D=yes  I  E=no) . 

Because  the  definition  of  relative  risk  involves 
conditional  probabilities,  the  estimation 
procedure  depends  on  which  variable,  if  either, 
was  fixed  by  the  study  design.  The  FREQ 
procedure  therefore  gives  different  estimates  for 
different  study  designs. 


u-s.- 


•  For  case-control  studies  (D  fixed,  E 
random),  the  estimator  of  the  common 
relative  risk  is  the  common  odds  ratio. 

•  For  cohort  studies  (E  fixed,  D  random)  and 
for  cross-sectional  studies  (D  and  E  both 
random),  there  is  a  direct  estimator  of  the 
common  relative  risk. 

See  the  SAS  User's  Guide:  Stat ist ics( 1985)  for 
computational  formulas  and  references  for  all  of 
the  test  statistics  and  measures  of  association. 

Example 

The  following  control  statements  read  some 
hypothetical  data  and  request  an  analysis  of 
association  from  PROC  FREQ. 

DATA; 

INPUT  FACTOR  $  DISEASE  $  COUNT; 

CARDS; 

YES  YES  19 
YES  NO  53 
NO  YES  ]  3 
NO  NO  65 

PROC  FREQ  ORDER=DATA; 

WEIGHT  COUNT; 

TABLE  FACTOR*DISEASE  /  ALL; 

Figure  I  displays  the  contingency  table  printed 
by  PROC  FREQ,  and  Figure  2  shows  the 
corresponding  statistics.  The  statistics  indicate  a 
nonsignificant(o= .  10)  •  association,  with  a 
relatively  small  correlation  coefficient  (.12).  The 
relative-risk  estimates  suggest  that  those  who  are 
exposed  to  the  factor  of  interest  are  at  least  one 
and  a  half  times  more  likely  to  get  the  disease 
than  those  who  are  not  exposed  to  the  factor. 


Figure  2 

STATISTICS  FOR  TABLE  OF  FACTOR  BY  DISEASE 


CHI-SQUARF. 

1 

2.109 

0.146 

LIKELIHOOD  RATIO  CHI-SQUARE 

I 

2.114 

0.146 

CONTINUITY  AUJ.  CHI-SQUARE 

1 

1.569 

0.210 

MANTEL-HAENSZEL  CHI-SQUARE 

I 

2.095 

0.148 

FISHER'S  EXACT  TEST  (1-TAIL) 

0.105 

(2-TAIL) 

0.  166 

nil 

0.119 

CONTINGENCY  COEFFICIENT 

0.118 

CRAMER’S  V 

0.119 

STATISTIC 

VALUE 

ASE 

GAMMA 

0.284 

0.166 

KENDALL'S  TAU-B 

0.119 

0.081 

STUART'S  TAU-C 

0.097 

0.067 

SOMERS'  D  CIR 

0.097 

0.067 

SOMERS'  D  RfC 

0.145 

0.098 

PEARSON  CORRELATION 

0.119 

0.081 

SPEARMAN  CORRELATION 

0.119 

0.061 

LAMBDA  ASYMMETRIC  ClR 

0.000 

O.COO 

LAMBDA  ASYMMETRIC  Rjc 

0.083 

0.075 

UMBDA  SYMMETRIC 

0.058 

0.052 
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UNCERTAINTY  COEFFICIENT  RjC 
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0.014 
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ESTIMATES  OF  THE  REUTIVE 
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(R0W1/R0W2) 

TYPE  OF  STUDY  VALUE 

95% 

CONFIDENCE 

BOUNDS 
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COHORT  (COLl  RISK) 
COHORT  (C0L2  RISK) 


Figure  1 

TABLE  OF  FACTOR  BY  DISEASE 


FREQUENCY 
PERCENT 
ROW  PCT 
COL  PCT 


3.  STRATIFIED  ANALYSIS 

The  FREQ  procedure  provides  an  analysis  of  the 
relationship  between  two  variables,  after 
adjusting  for  the  effect  of  potential  confounding 
variables.  Stratified  analysis  is  similar  to  the 
process  of  fitting  a  regression  model  that  relates 
some  function  of  the  dependent  variable  to  a 
linear  combination  of  the  independent  variable 
and  the  confounding  variables.  The  advantage  of 
stratified  analysis  over  regression  is  twofold:  (1) 
you  can  adjust  for  the  effect  of  the  confounding 
variables  without  being  forced  to  estimate 
parameters  for  them,  and  (2)  you  can  get  a  much 
clearer  picture  of  the  patterns  of  interaction  and 
the  sources  of  variation  since  you  can  look  at 
statistics  from  the  individual  strata. 

For  specifying  a  stratified  analysis  of  the 
relationship  between  variables  C  and  D,  after 
adjusting  for  variables  A  and  B,  the  required 
statements  are 


PROC  FREQ; 

TABLK.S  A*B*C*D  /  ALL; 


On  the  basis  of  these  statements,  one  stratum  is 
formed  for  each  combination  of  the  levels  of 
variables  A  and  B.  For  each  stratum,  a 
contingency  table  of  C  by  D  is  printed,  together 
with  test  statistics  and  measures  of  association. 
Lastly,  the  FREQ  procedure  prints  the  statistics 
that  summarize  the  information  across  the  strata 
in  an  efficient  way.  The  following  sections 
pertain  to  these  summary  statistics. 


3.1  Partial  Association  Testing 

The  class  of  generalized  Cochran-Mantel-Haenszel 
(CMH)  statistics  (Landis,  Heyman,  and  Koch 
1978)  is  an  important  class  of  statistics  for 
testing  no  partial  association  in  a  stratified 
analysis.  They  have  several  major  advantages. 

•  The  assumptions  required  for  their  validity 
are  minimal.  They  do  not  require  a  linear 
model,  nor  do  they  assume  any  parametric 
form  for  the  observed  data.  They  require 
only  fixed  row  and  column  margins  for  the 
contingency  table  in  each  stratum,  and 
these  fixed  margins  can  be  obtained  by 
design  or  by  conditional  distribution 
arguments . 

•  They  do  not  require  a  large  sample  size 
within  each  stratum.  They  have  a  chi- 
square  distribution  when  the  null 
hypothesis  of  no  partial  association  is  true 
and  when  the  effective  overall  sample  size  is 
large. 

•  The  statistics  depend  on  scores  for  the  row 
and  column  variables.  The  scores  give 
flexibility  with  respect  to  the  alternative 
hypothesis  being  tested,  and  they  allow  the 
choice  of  parametric  or  nonparametric 
analyses . 


CMH  statistics  have  low  power  for  detecting  an 
association  in  which  the  patterns  of  association 
for  some  of  the  strata  are  in  the  opposite 
direction  of  the  patterns  displayed  by  other 
strata.  Thus,  a  nonsignificant  CMH  statistic 
suggests  either  that  there  is  no  association,  or 
that  no  pattern  of  association  had  enough 
strength  or  consistency  to  dominate  any  other 
pattern . 

The  formulas  for  the  CMH  statistics  are  given  in 
the  SAS  User's  Guide;  Slat ist ics( 1985) .  For 
additional  information  on  the  development  of  CMH 
statistics,  see  Cochran  (195d),  Mantel  and 
Haenszel  (1959),  Mantel  (1963),  Birch  (1965), 
Landis,  Heyman,  and  Koch  (1978). 

The  FREQ  procedure  computes  the  following 
types  of  CMH  statistics,  reflecting  different 
alternative  hypotheses. 

The  correlation  statistic  (df=1) 


The  correlation  statistic,  with  one  degree  of 


freedom,  is  also  known  as  the  Mantel-Haenszel 
statistic.  This  statistic  requires  that  both  the 
row  and  column  variables  be  ordinally  scaled, 
and  the  alternative  hypothesis  is  that  there  is  a 
linear  association  in  at  least  one  stratum.  When 
there  is  only  one  stratum,  the  Mantel-Haenszel 
statistic  reduces  to  (N-I)r’,  where  r  is  a 
correlation  coefficient  (either  Pearson  or 
Spearman,  depending  on  whether  the  scores  are 
parametric  or  nonparametric). 


The  ANOVA  statistic  (df=R-1) 


This  statistic  requires  that  the  column  variable 
lie  on  an  ordinal  (or  interval)  scale.  The  mean 
column  score  is  computed  for  each  row  of  the 
table,  and  the  alternative  hypothesis  is  that,  for 
at  least  one  stratum,  the  mean  scores  of  the  R 
rows  are  unequal.  In  other  words,  the  statistic 
is  sensitive  to  location  differences  among  the  R 
distributions  of  the  column  variable. 

When  there  is  only  one  stratum,  this  CMH 
statistic  is  essentially  an  analysis- of- variance 
(ANOVA)  statistic  in  the  sense  that  it  is  a 
function  of  the  variance  ratio  F  statistic.  If 
nonparametric  scores  are  specified  in  this  case, 
then  the  ANOVA  statistic  is  identical  to  a 
Kruskal-Wallis  test. 

If  there  is  more  than  one  stratum,  then  the  CMH 
statistic  corresponds  to  a  stratum-adjusted 
ANOVA  or  Kruskal-Wallis  test.  In  the  special 
case  where  there  is  one  subject  per  row  and  one 
subject  per  column  in  the  contingency  table  of 
each  stratum,  this  CMH  statistic  is  identical  to 
Friedman's  chi-square. 


The  general  association  statistic  (df=(R-1)(C-1)) 


This  statistic  is  always  interpretable  because  it 
does  not  require  an  ordinal  scale  for  either 
variable.  The  alternative  hypothesis  is  that,  for 
at  least  one  stratum,  there  is  some  kind  of 
association.  When  there  is  only  one  stratum,  then 
the  general  association  CMH  statistic  reduces  to 
{(N-1)/N)Op,  where  Qp  is  the  Pearson  chi- 

square  statistic. 


Example 


As  an  example  of  partial  association  testing,  we 
consider  data  from  a  study  of  the  treatment  of 
duodenal  ulcer  (Grizzic,  Starmer,  and  Koch 
1969).  Specifically,  interest  lies  in  the  question 
of  whether  there  is  an  association  between 
treatment  and  the  severity  of  an  undesirable 
complication  of  treatment  called  dumping 
syndrome.  As  indicated  in  Figure  3,  severity  is 
ordinally  scaled  (none,  slight,  moderate),  and 
treatment  is  also  ordinally  scaled  since  the 
treatments  correspond  to  the  percentage  of  the 
stomach  removed  during  a  surgical  operation. 
The  hospital  at  which  surgery  was  done 


represents  a  potential  confounding  variable  which 
needs  to  be  controlled  in  the  analysis. 


Figure  U 

ANALYSIS  OF  DUMPING  SYNDROME  DATA 


Figure  3  shows  the  control  statements  required  to 
do  both  a  parametric  and  a  nonparametric 
stratified  analysis.  As  shown  in  Figure  4,  the 
general-association  and  the  analysis-of-variance 
CMH  statistics  are  nonsignificant  (a=.05),  but 
the  correlation  statistics  are  significant  (p<.02). 
This  indicates  that  there  is  a  linear  association  in 
at  least  one  of  the  strata,  and  it  illustrates  the 
value  of  having  statistics  that  have  their  power 
concentrated  on  narrowly  defined  alternative 
hypotheses. 

Figure  5  displays  the  correlation  results  from  the 
individual  strata.  The  source  of  correlation  and 
the  pattern  of  interaction  is  very  clear:  the 
linear  association  between  treatment  and  severity 
arises  only  from  hospital  2. 


Figure  3 


DUMPING  SYNDROME  DATA 


INDEPENDENT  VARIABLES 


1.  TREATMENT(OPERATION) 

A.  DRAINAGE  AND  VAGOTOMY 

B.  25%  RESECTION  AND  VAGOTOMY 

C.  50%  RESECTION  AND  VAGOTOMY 

D.  75%  RESECTION 

2.  IIOSPITALCI,  2,  3,  4) 


DEPENDENT  VARIABLE 


(  SEVERITY  OF  DUMPING  SYNDROME  I 

1  (NONE,  SLIGHT.  MODERATE)  1 

I  I 

I  REFERENCE:  GRIZZLE,  ETAL.(1969),  | 

I  BIOMETRICS  25,  489-504.  | 

............ 

PROC  KREQ  0R0ER=UATA; 

WEIGHT  WT; 

TABLES  HOSPITAL*TRT*SEVERITY  /  ALL; 

TABLES  HOSPITAL*TRT*SEVERITY  /  ALL  .SC0RES=RANK: 
TITLE  'ANALYSIS  OF  DUMPING  SYNDROME  DATA'; 


SUMMARY  STATJSTir.'  FOR  TRT  BY  SF.VERJTY 
CONTROLLING  FOR  HOSPITAL 
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Figure  5 

Correlation  Analysis  by  Stratum 


Hospital 

Sample 

Size 

Pearson 
Corre lat ion 

Hante 1 -Haenszel 

Chi 'Square  DF 

Prob 

1 

148 

0.  10 

1.57 

1 

.21 

2 

105 

0.26 

7.06 

1 

.01 

3 

74 

0.05 

0.16 

1 

.69 

4 

90 

0.09 

0.66 

I 

.42 

3.2  Estimation  of  Relative  Risk 

As  in  the  case  of  a  single  two-way  contingency 
table,  the  estimate  of  relative  risk  depends  on 
the  study  design,  and  thus  PROC  FREQ  gives 
yeparate  estimates  for  the  different  designs. 
Also,  it  uses  two  different  methods  to  obtain  the 
estimate  and  its  corresponding  confidence 
interval . 

•  Mantel-Haenszel  estimate,  with  a  test-based 
confidence  interval 

•  Logit  estimate,  with  a  precision-based 
confidence  interval 


A  major  advantage  of  the  Mantel-Haenszel  (MH) 
estimator  over  the  logit  estimator  (Woolf  1955, 
Haldane  1955)  is  that  cell  frequencies  of  zero 
pose  no  computational  problem  for  the  MH 
estimator.  Thus,  there  is  no  need  to  add  1/2  to 
certain  cell  frequencies,  as  is  sometimes 
necessary  with  the  logit  estimator  and  its 
corresponding  confidence  interval. 


The  test-based  confidence  interval  has  some 
theoretical  problems  because  it  is  based  on  the 
assumption  that  the  Cochran-Mantel - Haenszef  test 
statistic  has  a  chi-square  distribution,  which  is 
true  only  when  the  null  hypothesis  of  no  partial 
association  is  true.  However,  from  a  practical 
point  of  view,  the  bias  seems  to  be  very  small 
when  the  parameter  of  interest  does  not  differ 
greatly  from  1  (say,  for  1/4  <  RR  <  4). 

The  formulas  for  the  estimators  are  given  in  the 
SAS  User’s  Guide;  Statist ics( 1985) .  For 
additional  information  on  stratified  analysis, 
relative  risk  estimation,  and  confidence  interval 
estimation,  see  Kleinbaum,  Kupper,  and 
Morgenstern  (1982). 


Example 

These  data  are  from  a  detergent  preference 
study  (Cox  1970).  See  Figure  6  for  a  description 
of  the  dependent  and  independent  variables,  and 
Figure  7  for  a  listing  of  the  data  and  the  control 
statements  required  to  do  a  stratified  analysis 
with  PROC  FREQ.  The  question  of  interest  for 
this  example  is  the  following.  Is  there  an 
association  between  preferred  brand  of  laundry 
detergent  and  previous  usage  of  Brand  M,  after 
controlling  for  the  softness  and  the  temperature 
of  the  laundry  water,  and  if  so,  what  is  the 
magnitude  of  the  relationship? 


Figure  8  displays  the  contingency  table  for 
stratum  1,  ar^d  Figure  9  shows  the  page  of 
summary  statistics  from  the  printed  output.  The 
CMH  statistic  is  highly  significant,  indicating 
very  strong  evidence  of  a  partial  association 
between  preferred  brand  and  previous  usage  of 
Brand  M.  This  study  was  a  cioss- sectional 
study,  and  the  conting'^ncy  tables  are  set  up 
with  PREF=M  in  the  first  column  of  each  table. 
Thus,  we  refer  to  the  COLl  RISK  section  of  the 
output  for  estimation  of  relative  risk.  The  results 
indicate  that,  on  the  average,  previous  users  of 
Brand  M  laundry  detergent  arn  about  1.3 
(  =  l/.75)  times  more  likely  to  prefer  Brand  M  than 
those  who  are  not  previous  users. 


Figure  6 


DETERGENT  PREFERENCE  STUDY 


DEPENDENT  VARIABLE 


BRAND  =  BRAND  PREFERRED 


INDEPENDENT  VARIABLES 


SOFTNESS  =  SOFTNESS  OF  WATER  SOIT,  ME 

PREV  s  PREVIOUS  USER  OF  BRAND  M?  YES,  NO 


SOIT.  MED,  HARD 


TEMP  *  TEMP  OF  LAUNDRY  WATER 


HIGH.  LOW 


FROM;  RIES  AND  SMITH,  CHEMICAL  ENGINEERING 
PROGRESS  39(1963),  PP.  39-43. 

COX(1970)  THE  ANALYSIS  OF  BINARY  DATA.  P.38 


Figure  7 

TITLE  'DETERGENT  PREFERENCE  STUDY’; 

DATA  DETERG; 

INPIT  SOFTNESSS  BRAND?  PREV$  TEMP?  COUNT  @(3; 
CARDS ; 
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PROC 

FREQ; 

WEIGHT 

TABLE 

COUNT ; 
SOF'rNESS'-'TE 

HP*I 

’REV'-’BRAN 

)  / 

ALL; 

Ip 

Figure  10  shows  a  relative  -  risk  analysis  by 
stratum.  The  results  indicate  a  fair  amount  of 
interaction,  with  strata  1,  2,  and  3  having 

similar  estimates  (.65,  .65,  .61),  with  strata  4 
and  5  displaying  a  weaker  association  (.80,  80), 

and  witli  stratum  6  showing  no  association  (.99). 
Given  the  large  sample  sizes  within  each  stratum, 
one  could  use  the  CATMOO  procedure  to  do 
modeling  of  the  relative  risk  estimates 
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4.  GENERAL  LINEAR  MODEL  ANALYSIS 
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Figure  10 

Drtergout  Preferonce  Stu<ly 
Relative  Risk  Analysis  by  Str.itum 
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The  CATMOD  procedure  fits  linear  models  to 
general  functions  of  categorical  data.  It  does  so 
by  facilitating  transformations  of  an  initial 
proportion  vector(p)  to  a  function  vector(F), 
and  by  estimating  the  parameters  of  the  linear 
model  F(n)  =  X&,  where  s  is  the  vector  of 
underlying  probabilities.  CATMOD  uses  one  of 
two  estimation  methods: 

•  weighted-least-squares  estimation,  available 
for  all  types  of  response  functions 

•  maximum-likelihood  estimation,  available  for 
logistic  regression  and  log-linear  models. 

Both  methods  of  estimation  are  BAN  (best 
asymptotic  normal),  and  therefore  they  are 
asymptotically  equivalent.  After  the  parameters 
are  estimated,  CATMOD  computes  a  goodness-of- 
fit  test,  as  well  as  Wald  statistics  for  testing 
model  effects  (such  as  main  effects  and 
interactions)  and  other  null  hypotheses  of 
interest . 

The  theory  for  the  weighted-least-squares 
estimation  and  the  general  linear  modeling  may  be 
found  in  Grizzle,  Starmer,  and  Koch(1969).  The 
theory  for  the  maximum-likelihood  estimation  and 
the  log-linear  modeling  is  in  Fienberg(  1980)  and 
Bishop,  Fienberg,  and  Holland(1975) .  The 
computational  formulas  used  by  CATMOD  can  be 
found  in  the  SAS  User's  Guide:  Stat  ist  i cs(  1985) . 

One  can  analyze  almost  any  functions  of  the 
original  proportions,  including  logits,  marginal 
probabilities,  marginal  logits,  means,  cumulative 
probabilities ,  cumulative  logits,  survival 
probabilities,  kappa  statistics,  odds  ratios,  risk 
ratios,  etc.  Some  of  the  most  common  analyses 
use  linear  response  functions  (for  linear  models) 
or  logit  response  functions  (for  logistic 
regression  and  log-linear  models).  The  two 
examples  in  this  section  illustrate  a  linear  model 
and  a  logistic  regression.  Log-linear  models  and 
repeated  measures  analysis  are  dealt  with  in 
separate  sections. 


Example 


The  first  example  is  a  linear  model  analysis  of  the 
detergent  preference  data  used  in  Section  3.  The 
control  statements  required  to  fit  a  main-effects 
model  are 

PROC  CATMOD; 

RESPONSE  ]  0; 

WEIGHT  COUNT; 

MODEL  BRAND  =  SOFTNESS  PREV  TEMP; 

TITLE2  LINEAR  MAIN-EFFECTS  MODEL'; 

Figure  11  shows  part  of  the  output  printed  by 
CATMOD.  ^  he  design  matrix  X  contains  columns 
corresponding  to  the  mam  effects  in  the  model 
statement.  The  analysis-of  variance  table  shows 
that  the  model  fits  the  data  adequately  (Q=8.26, 


df=7),  and  that  the  PREV  and  TEMP  main  effects 
are  statistically  significant  (a=.05).  The  analysis 
of  individual  parameters^  gives  the  parameter 
estimates  and  their  standard  errors.  The 
estimated  covariance  matrix  and  the  correlation 
matrix  of  the  estimated  parameters  are  also 
computed  upon  request. 


Figure  11 
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Example 


The  second  example  is  a  logistic  regression 
analysis  of  the  same  data.  The  response 
functions  to  be  analyzed  are  the  logits,  but  the 
required  control  statements  do  not  include  a 
response  statement  since  logits  are  the  default 
response  functions: 

PROC  CATMOD; 

WEIGHT  COUNT: 

MODEL  BRAND  =  SOFTNESS  PREV  TEMP 

/  NOPROFILE  NODESIGN  NOPARM  ML; 
IirLE2  LOGIT  MAIN-EFFECTS  MODEL'; 

The  ML  specification  in  the  MODEL  statement 
requests  maximum-  likelihood  estimation  of  the 
parameters 

Figure  12  shows  the  maximum  likelihood  analysis 
of  the  data.  The  initial  estimates  (iteration  0)  of 


the  parameters  are  the  weighted-least-squares 
estimates,  and  subsequent  estimates  are  printed 
for  each  Newton-Raphson  iteration  until 
convergence  is  achieved.  The  goodness-of-fit 
test  in  the  analysis-of-variance  table  is  the 
likelihood-ratio  test,  and  it  shows  that  the  model 
fits  the  data  (0=8.23,  df=7).  With  respect  to  the 
significance  of  the  main  effects  in  the  model,  the 
Wald  statistics  based  on  the  maximum-likelihood 
estimates  for  the  logit  model  are  very  similar  to 
those  based  on  weighted-least-squares  estimates 
for  the  linear  model.  Predicted  cell  frequencies 
are  also  computed  by  CATMOD,  if  requested. 


Figure  12 
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5.  LOG-LINEAR  MODELS,  MAXIMUM  LIKELIHOOD 

General  log-linear  modeling,  with  hierarchal  or 
nonhierarchal  models,  can  be  done  by  the 

CATMOD  procedure.  Both  weighted-least-squares 
and  maximum-hkelihood  (ML)  estimation  are 
available.  CATMOD  uses  Newton-Raphson 

iteration  to  obtain  its  maximum-likelihood 

estimates.  If  one  has  a  large  hieriirchal  model, 

then  iterative  proportional  fitting  (IPF)  is  a  more 
efficient  method  of  ML  estimation,  and  the  IPF 
function  in  the  IML  procedure  can  be  used  for 
this  purpose . 

The  basic  log-linear  model  for  one  population  may 
be  expressed  as 

n  =  exp(XB)  /  TexptXB) 

where  n  is  the  vector  of  multinomial  probabilities 
for  the  population.  Because  of  the  restriction 


that  the  probabilities  add  to  one,  an  equivalent 
way  of  expressing  the  model  is 

F(.)  =  C  login)  =  CXB  =  X*B 

where 


But  F(x)  is  simply  the  vector  of  generalized  (or 
multiple)  logits  for  the  population  probabilities. 
Thus,  the  latter  equations  show  that  fitting  a 
log-linear  model  on  the  probabilities  is  equivalent 
to  fitting  a  linear  model  on  the  generalized  logits. 
Such  a  transformation  brings  log-linear  modeling 
into  a  general  linear  modeling  framework,  so  that 
the  power  and  flexibility  of  a  program  such  as 
CATMOD  can  be  brought  to  bear  on  log-linear 
models . 

In  particular,  the  generalization  of  log-linear 
models  to  multiple  populations  is  totally 
straightforward  with  CATMOD.  Multiple 
populations  are  formed  on  the  basis  of 
independent  (or  design)  variables,  and  a 
separate  multinomial  distribution  is  assumed  for 
each  population.  The  model  equations  for  such 
multiple  population  log-linear  models  can  be 
found  in  Imrey(l985)  and  Imrey,  Koch,  and 
Stokes ( 1981 ) . 

Imrey(1985)  illustrates  the  use  of  the  CATMOD 
procedure  for  numerous  logit  and  log-linear 
model  applications.  Including  multiple  logistic 
models,  quasi-independence,  proportional  odds 
models,  and  a  repeated  measures  (split-plot) 
analysis  of  marginal  logits.  Imrey  also  discusses 
some  of  the  technicalities  of  CATMOD,  including 
the  role  of  the  REPEATED  statement  in  log-linear 
model  analysis,  the  treatment  of  structural  vs. 
random  zeros,  and  alternative  formulations  of 
logistic  models  in  terms  of  log-linear  models. 


Example 


The  example  is  a  simple  one-population  study  in 
which  each  subject  was  given  three  different 
drugs,  and  their  response  ( F  =  Favorable, 
U=Unfavorable)  to  each  was  recorded  (Koch  et 
al.  1977).  The  following  control  statements  set  up 
the  data  set  and  specify  a  maximum-likelihood 
analysis  of  a  log-linear  model; 

DATA  DKI  GS; 

IMMT  DRUGA  $  DRUGB  $  DRUGC  $  COUNT  (Asi; 

CARDS; 

FFF6  FFU16  FUF2  FUU4 
UFF2  UFU  4  UUF6  UUU6 

FROG  CATMOD; 

WF.IGHT  COUNT; 

RF.SFONSF,  nUT=PREU; 

MODF.I,  DRUGA*DRUGB*DRUGC  =  .RESPONSE. 

/  .'II.  COVB  PRED=FREQ; 

REPEATED 

/  RESPONSE  =  DRUGA  DRUGB  DRUGC  DRUGA*DRUGB ; 
TITEF,  'ONE -POPULATION  DRUG  STUDY'; 

TITLE2  'MI.F.  ANALYSIS  OF  THE  JOINT  FREQUENCIES'; 


The  RESPONSE  statement  specifies  the  analysis 
of  generalized  logits  and  the  creation  of  an 
output  data  set  containing  predicted  values.  The 
responses  to  the  three  drugs  are  designated  as 
dependent  variables  by  their  appearance  on  the 
left-hand  side  of  the  MODEL  statement. 

The  RESPONSE  keyword  in  the  MODEL 
statement  indicates  that  the  model  is  to  include 
sources  of  variation  based  on  the  levels  of  the 
dependent  variables.  The  REPEATED  statement  is 
used  only  to  define  the  .RESPONSE,  effect  in 
terms  of  the  usual  log-linear  model  main  effects 
and  interactions.  (When  there  is  no  repeated 
measurement  involved  in  the  study,  then  the 
term  REPEATED  is  a  misnomer,  but  the  definition 
of  .RESPONSE,  is  nonetheless  placed  on  the 
REPEATED  statement.)  The  specified  model 
contains  main  effects  for  each  of  the  three 
drugs,  together  with  the  DRUGA*DRLIGB 
interaction.  The  MODEL  statement  also  requests 
maximum  -  likelihood  analysis,  predicted  cell 
frequencies,  and  the  estimated  covariance  matrix 
of  the  parameter  estimates. 

Figure  13  shows  the  results  of  the  maximum- 
likelihood  analysis,  with  the  final  parameter 
estimates  appearing  in  the  row  corresponding  to 
the  last  iteration.  The  analysis-of- variance  table 
gives  the  likelihood-ratio  goodness-of-fit  test, 
together  with  Wald  statistics  for  testing  the 
individual  effects  in  the  model.  Figure  14 
contains  the  estimated  covariance  matrix  of  the 
parameter  estimates,  along  with  the  table  of 
predicted  cell  frequencies  and  their  standard 
errors . 
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Figure  14 

ONE-POPULATION  DRUG  STUDY 
MLE  ANALYSIS  OF  THE  JOINT  FRKQUKSOIES 


UOVAKIANCE  OF  ESTIMATES 
I  3  1 


I  ().(USSS!3  -0.013T«43  1.  >0^1. -IS  - . 

3  -a.OI2/'fi41  0  038fl«l3  I  .  -  IK  -.0()33f>Su;, 

5  l.'iOr>E-l«  l.OoiE-l«  0.03)'‘*>«  -1 .  1  •> 

4  -,0033b'j'J4  -.00336V)4  '3.7;r»E-lM  0.02ftKKn 


I’KEDICTKI)  VALUES  [‘OR  KESi'USSK.  ELNCTIuNS  AND  KKE- 'I  ENC I ES 
OBSERVED  PREDICTED 


11  NcrioN 
MMRKK 

n  scr  1  ns 

STASDAKO 

LkkOR 

FUNCTION 

SrANDAKD 

KRKOk 

KF.SIDUAI. 

1 

0 

0.  i771  Y 

•.022473 

0  473‘)2R 

.0224  73') 

3 

0 . 

0.4:fi7U 

O.ftOftlift 

0. 3)K8'’x4 

0. 374ft'»3 

3 

-1 ,0’>Kf3l 

0.8  164')  7 

-1.3217ft 

0.  >.S,S()3ft 

0.323144 

-  .  4t*54»33 

0  .  o4  34')  7 

-  .ft'ni47 

0.w'»')‘)78 

0.2K7ft83 

•. 

•  1 .  n')Hf»j 

0  KJft4')7 

-1.3217ft 

0. yRKOVft 

0.223144 

*  .  wO*)-*?!  3 

0.ft4’i4')7 

-.ft93147 

0.4'»'»‘):« 

0. 2B7e83 

* 

0 

0  .  >  7  7  3  5 

•  .ft2Rft0‘) 

0. 30«)yft? 

0.63fiftO9 

n 

ft 

:.3R41ft 

7 . fti2  1  7 

1  .04  >0> 

•1. ft >2 17 

CJ 

Ift 

14.  54  r« 

2.ft'>ft<‘K 

1.  ft  >2)  7 

13 

3 

1  3R)13 

3  .  A8ft')6 

(‘.R')')3’»3 

-  .OSft«)'>7 

.. 

1  ')  n  0  7 

3.')1  104 

1  *.-.800 

OKft‘)5ftr 

»■’. 

3 

1  38  M3 

2.0Rft')ft 

t).8')')_*'l  3 

•.OKft'»',7 

l.'Jlin: 

3  ')  1  304 

1  l-a.Sfi'l 

OHo‘)  ‘.ftft 

ft 

3 

4.  17  3'»I 

1 . 3353 

I  .R.'ft'"' 

ft 

3.3H-10 

7  H2o0*) 

•l.SJftO'l 

Iterative  proportional  fitting  of  the  model  is  also 
available,  and  may  be  desirable  for  those 
situations  in  which  the  contingency  table  is  very 
large  and  the  hierarchal  model  contains  a  great 
many  parameters.  For  this  eKample,  the  required 
control  statements  for  IFF  estimation  of  the 
parameters  are 

PROC  I  Ml,; 

TIT!,E2  'IPF  ESTIMATION  OF  THE  FREQUENCIES'; 

DIM  =  {2  2  2); 

TABLE  =  {  6  16  2  4  , 

2  466); 

CONFIG  =  {  I  2  , 

0  3); 

CALL  I PF( F IT , STATUS  , D I .M , TAB LF. , CONF 1 G I ; 

PRINT  "OBSERVED  FREQUENCIES  ";  PRINT  TABLE; 

PRINT  "ESTIMATED  FREQUENCIES"; 

PRINT  FIT(FORMAT=7.5); 


Figure  15 
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Since  the  IPF  analysis  yields  only  the  estimated 
cell  frequencies,  one  might  be  interested  in 
running  a  general  linear  model  analysis  of  the 
predicted  cell  frequencies  in  order  to  obtain 
other  useful  information  such  as  (1)  Wald 
statistics  for  the  individual  effects  in  the  model, 
and  (2)  the  maximum-likelihood  estimate  of  the 
covariance  matrix  of  the  estimated  parameters. 
The  required  control  statements  are  the  same  as 
those  used  previously,  except  that  the  observed 
frequencies  are  replaced  by  the  predicted 
frequencies  in  the  WEIGHT  statement: 

* _ _ _ *; 

DATA  PREDICT;  SET  PRED;  IF  _TYPE_='FREQ’ ; 

DATA  DRUG2;  MERGE  DRUGS  PREDICT; 

_ ..... - .... - 

PROC  CATMOD; 

WEIGHT  .PRED_; 

MODEL  DRUGA-'DRUGB’-DRUGC  =  ^RESPONSE. 

/  COV  ML  COVB  PRED=FREQ; 

REPEATED 

/  .RESPONSE.  =  DRUGA  DRUGB  DRUGC  DRUGA'-DRUCB ; 
TITLE2  'ANALYSIS  OF  IPF-ESTIMATED  FREQUENCIES'; 

The  results,  shown  in  Figures  16  and  17,  are 
essentially  the  same  as  the  previous  results, 
except  that  only  one  Newton-Raphson  iteration  is 
required  for  convergence,  and  the  goodness-of- 
fit  statistic  is  zero,  as  are  the  residuals. 


Figure  15  shows  that  the  cell  frequencies 
predicted  from  the  IPF  algorithm  are  identical  to 
those  obtained  from  the  Newton-Raphson 
algorithm  in  Figure  14. 
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Figure  17 
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6.  MODELS  FOR  ORDINAL  DATA 

A  recent  book  by  Agresti  ( 1984)  focuses  on 
analysis  methods  that  can  be  used  whenever 
there  are  ordinally'scaled  variables  to  be 
analyzed.  Two  of  the  primary  methods  of  analysis 
recommended  by  Agresti  can  be  done  with  the 


CATMOD  procedure: 

•  log-linear  models,  using  generalized  logits 

•  logit  models,  using  cumulative  logits. 

Cumulative  logits  (logits  of  cumulative 
probabilities)  are  monotonically  increasing  (or 
decreasing),  so  that  the  ordinal  nature  of  the 
dependent  variable  is  automatically  incorporated 
into  those  functions.  Generalized  logits,  on  the 
other  hand,  do  not  inherently  reflect  the  ordinal 
nature  of  the  dependent  variable,  and  therefore 
the  ordinality  must  be  built  into  the  design 
matrix  in  a  general  linear  model. 

Regardless  of  whether  the  model  is  logit  or  log- 
linear,  structural  models  can  be  built  that  reflect 
certain  hypotheses  and  take  into  account  the 
scaling  of  other  variables  in  the  analysis.  The 
following  discussion  assumes  a  two-way  table, 
with  the  dependent  (column)  variable  always 
presumed  to  be  ordrnally  scaled.  Three  of  the 
most  important  structural  models  are  as  follows. 

•  INDEPENDENCE  MODEL  For  the  log- 

linear  model,  this  structural  model  implies 
an  odds  ratio  of  1  for  every  choice  of  two 
rows  and  two  columns.  For  the  logit  model, 
it  implies  an  odds  ratio  of  1  for  every 
possible  dichotomy  of  the  column  variable 
and  every  pair  of  rows. 

•  ROW-EFFECTS  MODEL  ---  This  mode)  is 
used  when  the  row  variable  lies  on  a  nominal 
scale.  Compared  to  the  independence  model. 
It  contains  one  additional  parameter  for  each 
row.  For  a  log-linear  model  with  integer 
column  scores,  it  implies  that  the  odds  ratio 
for  2  adjacent  columns  and  for  any  2  rows  is 
a  function  of  the  difference  between  the  row 
parameters.  For  a  logit  model,  it  implies 
that  the  odds  ratio  for  any  2  rows  is  a 
function  of  the  difference  between  the  row 
parameters,  regardless  of  which  collapsing 
is  used  to  form  a  dichotomy  of  the  column 
variable. 

•  UNIFORM-ASSOCIATION  MODEL  ---This 
model  is  used  when  the  row  variable  lies  on 
an  ordinal  scale.  Compared  to  the 
independence  model,  it  contains  one 
additional  parameter,  B,  that  measures  the 
association  between  the  two  variables.  For  a 
log-linear  model  with  Integer  row  and 
column  scores,  it  implies  that  the  odds  ratio 
for  any  2  adjacent  columns  and  any  2 
adjacent  rows  is  exp(B).  Such  a  model  is 
also  called  an  equal  adjacent  odds  ratio 
model.  For  a  logit  model  with  integer  row 
scores,  it  implies  that  the  odds  ratio  for  any 
2  adjacent  rows  is  exp(B),  regardless  of 
which  collapsing  is  used  to  form  a  dichotomy 
of  the  column  variable.  Such  a  model  is  also 
called  a  proportional  odds  model. 

All  of  these  ordinal  models  can  be  generalized  to 
the  case  of  multiple  variables. 


Example 


ANALYSIS  or  INDIVIDUAL  PARAMETERS 


The  methods  are  Illustrated  with  the  dumping 
syndrome  data  introduced  in  Section  3.  The 
dependent  variable,  severity,  is  ordinally  scaled 
(with  values  NONE,  SLIGHT,  and  MODERATE), 
and  the  independent  variable,  treatment,  is  also 
ordinally  scaled  since  the  treatments  correspond 
to  the  percentage  of  the  stomach  removed  during 
a  surgical  operation  (0,  25,  50,  75).  Thus,  a 
uniform-association  model  is  most  appropriate  for 
these  data,  and  that  is  the  type  of  structural 
model  fitted  here.  The  variable  HOSPITAL  is 
ignored  in  order  to  illustrate  the  two-variable 
models . 

Figure  18  shows  the  control  statements  required 
to  fit  the  log-linear  uniform-association  model. 
The  third  column  of  the  design  matrix  reflects, 
in  a  multiplicative  way,  the  ordinal  scales  of  the 
variables  treatment  and  severity  (  (2  1) 

kronecker  (1  2  3  4)  ).  Figure  19  displays  the 
results  of  the  maximum-likelihood  analysis, 
showing  that  the  model  fits  well.  The  operation 
effect  is  now  significant  (p=.01)  due  to  the  facts 
that  the  ordinal  nature  of  treatment  has  been 
exploited  and  there  is  some  linear  association 
between  treatment  and  severity.  The  maximum- 
likelihood  estimate  of  the  uniform-association 
parameter  P  (-.162)  converts  to  a  uniform  odds 
ratio  estimate  of  exp( -.  162)  =0. 85  . 


EFFECT 

PARAMETER 

ESTIMATE 

STANDARD 

ERROR 

CHI- 

SQUARE 

PROS 

HODCL 

1 

2. 4672 

0.391377 

39.74 

0.0001 

2 

1.43336 

0.255521 

31.47 

0.0001 

3 

162621 

.0655658 

6.15 

0.0132 

Figure  20  shows  the  control  statements  required 
to  fit  the  logit  uniform-association  model.  Figure 
21  displays  the  results  of  the  weighted- least- 
squares  analysis,  showing  that  the  estimate  and 
the  test  of  $  are  very  similar  to  those  obtained 
from  the  log-linear  model.  Although  the  first  two 
columns  of  the  design  matrix  are  parameterized 
differently  than  those  in  the  log-linear  model, 
they  span  the  same  space. 


Figure  20 

PROC  CATMOl)  0RDER=I)ATA; 

TITI.E2  'LOGIT  UNIFORM  ASSOCIATION'; 
WEIGHT  WT; 


DIRECT  TRTMNT; 
MODEL  SEVERITY 


=  ^RESPONSE. 


Figure  18 

PROC  CATMOD  0RDER=nATA; 

TITLE2  'LOGLINEAR  UNIFORM  ASSOCIATION' 
WEIGHT  WT; 

POPULATION  TRT; 

MODEL  SEVERITY  =(102, 


Figure  21 

ANALYSIS  OF  DUMPING  SYNDROME  DATA 
LOGIT  UNIFORM  ASSOCIATION 


CATMOD  PROCEDURE 


0  14) 

(  1  =  ' INTERCEPTl' , 

2  =  '  INTF.RCEPT2' , 

3  =  'OPERATION'  ) 

/  FREO  ONEWAY  ML  PREDICT=FREQ ; 


FUNCTION  RESPONSE 
SAMPLE  NUMBER  FUNCTION 


-0.555526 

-2.54273 


-0.635989 

-1.94591 


DESIGN  NATRI.v: 

1  2  3 


-0. 109199 
-2. 10006 


Figure  19 

ANALYSIS  OF  DUMPING  SYNDROME  DATA 
LOGLINEAR  UNIFORM  ASSOCIATION 


0.0186921 

-1.73827 


ANALYSIS  OF  VARIANCE  TABLE 


ANALYSIS  OF  VARIANCE  TABLE 


SOURCE 

DF 

CHI-SQUARE 

PROB 

INTERCEPTl 

1 

39.74 

0.0001 

INTERCEPT2 

1 

31.47 

O.OOOJ 

OPERATION 

1 

6.15 

0.0132 

LIKELIHOOD  RAIIO 

5 

4.59 

0.4680 

S'AmCE 

DF 

CHI-SQUARE 

PROB 

INTERCEPT 

1 

45.73 

o.nool 

.RESPONSE^ 

1 

142 . 29 

0.0001 

TRTMNT 

1 

6.37 

O.OU6 

RESIDUAL 

5 

4.57 

0.4712 

(Fig.  21  continued  on 

next  page) 

(Fig.  21  continued  from  previous  page) 

ANALYSIS  OF  INDIVIUL'AL  PAKAMKTKRS 


K.FKlXr  I’AKAMETKR 

ESTIMATE 

STANDARD 

ERROR 

CHI  - 
SQUARE 

PRtlB 

ISTERCEpr 

1 

1 . 73389 

0,25839^ 

45 . 73 

O.COOl 

RESroNSE„ 

2 

-.855483 

.0717182 

142 . 29 

0.0001 

TRTMNT 

3 

-0.22157 

.0877552 

t .  37 

0.0118 

Figure  22 

shows 

the  summary  of  tests  of 

the 

uniform-association  parameter  all  obtained 

from  CATMOD.  Regardless  of  whetlier  one  uses  a 
logit  or  a  log-linear  model,  maximum- likelihood  or 
weighted-least-squares  estimation,  Wald  or 
likelihood- ratio  tests,  the  results  are  essentially 
the  same.  A  similar  conclusion  can  be  drawn  from 
Figure  23,  which  displays  the  results  of  the 
estimation  of  6  and  the  uniform  odds  ratio, 
exp(e) . 


with  the  dependent  variable  is  measured  by  the 
corresponding  pararheter. 

In  the  following  analysis  of  the  dumping 
syndrome  data,  the  scores  for  SEVERITY  are  () 
for  none,  0  5  for  slight,  and  1  for  moderate.  The 
variables  hospital  and  treatment  are  both 
regarded  as  nominally  scaled: 

PKOC  tatMOI)  OKl)KR=I)ATA; 

Wf;i(;HT  WT; 

RKSrUNSK  0  0.5  1 ; 

MODLI.  SKt'KKrrV  =  TRT  HOSPITAL; 

T1TLF.2  'MMN-KKFKCTS  MOOKL’  ; 

The  results,  shown  in  Figure  24,  indicate  a 
significant  treatment  effect.  However,  if 
treatment  is  regarded  as  ordinally  scaled  (by  its 
appearance  in  a  DIRECT  statement): 


Figure  22 

Aunlysis  of  Dumping  Syridrome  Data 
Results  of  Testing  the  Uniform  Association  Parameter  6 


Typr  c>l 
Aii.i  lysis 

Typp  of 
Esl  imai ion 

Typ<»  of  Test  Tost 

Statistic  Statistic 

Pi  ob 

liOgl  ilHMf 

WLS 

Wald 

5.96 

0.01 

I,i)gl  lunar 

Ml.F. 

Wald 

6.  IS 

0.01 

Logl iuoar 

MLE 

IHT-' 

6 , 29 

O.OI 

Logit 

WLS 

Wald 

6.37 

0.01 

'LKT  =  0“ 
=  10 

-  8  , 

2 

(  1  udopondonce)  -  G  (Uniform  Associ.ition  ^fo<^p  1 ) 
,88  •  4.59 

29 

Aiuly.si.s  of 
Rrsults  of  Estimating 

Figvire  23 

Dumping  Syndromo  i)ata 

the  Uniform  Association  Parameter 

Typn  of 

All  1  lys  i  s 

Typo  of 
Kst  im.1t  ion 

Fst imate 
of  8 

Standard 

Ki  1  Cl 

•'Xpl  B  » 

Logl  uirar 

-0.  160 

0.0S6 

0.8S 

i.ogl  inrrii 

MLK 

-0 . 163 

0.088 

O.ftS 

Lor  It 

yi.s 

-0 .222 

O.OHH 

0 . 80 

Another 

powerful 

method  of 

dealing  with  an 

ordinal  dependent  variable  is  to  analyze  the  mean 
score  of  that  variable  for  each  population,  rather 
than  analyzing  a  set  of  logits  (Grizzle,  Starmer, 
and  Korh  1909).  If  an  independent  variable  is 
nominally  scaled  in  such  an  analysis,  then  it  is 
treated  as  a  mam  effect,  and  the  analysis  is 
sensitive  to  differences  among  the  levels  of  that 
variable  with  respect  to  the  mean  scores.  If  it  is 
ordinally  scaled,  then  it  is  treated  m  a 
qiiarj t  itativ  e  way  by  a  single  column  in  tfie  design 
matrix,  and  the  extent  of  its  linear  association 


PROC  CATMOD  ORDEK^DAPA ; 

WF.KDIT  VT; 

DIRECT  TRTM.NT; 

RESPONSE  0  ().■>  1; 

MODEL  SEVERITY  =  TRTMNT  HOSPITAL: 

TITLE2  'LINEAR  OPERATION  EFFECT'; 

then  the  results  (Figure  25)  show  even  stronger 
evidence  of  association. 

Figure  24 

ANALYSIS  OF  DUMP  INC  SYNDROME  DATA 
MAIN-EFFECTS  MODEL 


ANALYSIS  OF  VARIANCE  TABLE 


SOURCE 

UF 

cm -SQUARE 

PROB 

INTERCEPT 

1 

248.77 

0.0001 

TRT 

3 

B.90 

0.0307 

HOSPITAL 

3 

2.33 

0.5065 

RESIDUAL 

9 

6.33 

0.7069 

ANALYSIS 

Figure  25 

OK  DUMPING  SYNDROME  DATA 

LINEAR  OPERATION  EFFECT 


ANAI.YSIS  OF  VARIANCE  TABLE 


SOURCE 

DF 

CHI  -S(^UARE 

PROB 

INTERCEPT 

1 

18.28 

0.0001 

TR1HNT 

1 

8.60 

0.0034 

HOSPITAL 

3 

2.31 

0.5098 

RESIDUAL 

1  1 

6.( 

0.8284 

7.  REPEATED  MEASURES  ANALYSIS 


The  CATMOD  procedure  has  a  number  of  features 
that  facilitate  repeated  measures  analysis.  They 
include 

•  a  RFFEATTO  statement  that  allows  one  to 


identify  and  name  repeated  measurement 
factors 

•  a  very  general  modeling  specification  that 
allows  repeated  measures  to  be  modeled  in 
any  fashion 

•  shorthand  specification  of  commonly  used 
response  functions  in  repeated  measures 
analysis,  such  as  marginal  probabilities, 
marginal  logits,  and  means. 

Repeated  measures  methodology  and  the 
corresponding  CATMOD  capabilities  are  reviewed 
in  Stanish  and  Koch  (1984).  Numerous  examples 
are  given  there  and  in  the  SAS  lser*s  Guide: 
St .u  i  s t  i  r.s  ( 1985) .  The  following  example  is  a 
simple  illustration  of  an  analysis  of  marginal 
probabilities . 

Example 


These  data  are  from  a  study  of  the  effect  of 
advertising  on  sales  (Bishop,  Fienberg,  and 
Holland  1975,  p.  274).  At  each  of  two  time 
points,  subjects  were  asked  if  they  had  seen  an 
advertisement  for  a  specific  product  and  If  they 
had  bought  that  product.  The  question  of 
interest  is:  what  is  the  effect  on  sales  of  tlie  time 
between  the  two  interviews,  seeing  the  first 
advertisement,  and  seeing  the  second 
advertisement. 

The  first  model  is  simply  a  saturated  model  to 
assess  the  significance  cf  the  mam  effects  and 
interartions  of  the  independent  variables  and  the 
repeated  measurement  factor.  The  required 
conti'ol  statements  to  read  the  data  and  fit  the 
saturated  model  are  as  follows: 

DATA  A; 

ivni  SKKl  S  Sf;F.2  >  BUY]  S  b.'Y2  $  COt.ST 

CAKDS; 


Nf) 

NO 

YES 

VK.'^ 

05 

NO 

NO 

Yi.S 

NO 

15 

NO 

NO 

NO 

YKS 

6 

NO 

NO 

SO 

NO 

403 

YES 

Y!  S 

YES 

YK.S 

83 

YES 

YKS 

YES 

NO 

8 

YES 

YES 

NO 

YK.S 

22 

YES 

YES 

NO 

NO 

YES 

Ni) 

YES 

YKS 

3  5 

YKS 

NO 

Vl.S 

NO 

7 

YES 

Ni) 

NO 

YF.S 

1  1 

YES 

SO 

NO 

NO 

28 

Nl.) 

YES 

YES 

YKS 

25 

NO 

YES 

YES 

NO 

10 

NO 

YE.S 

NO 

YKS 

8 

NO 

YES 

.Vn 

Si) 

32 

I’KOC  CATMOD  OKUFK^PATA; 

VKICiir  COUNT; 

KKSrON.SK  MARGI.NALS; 

Mi’OKI,  HUYl'  BUYZ  =  SF.FlISKKJl  Ki.Si'nNSK  ; 

KKl’KATF.n  TtriF,  2; 

TITl.K  'ADVERTISING  DATA- - -SATURATED  MOIiEI,’; 

The  lesults,  shown  in  Figure  28,  indicate  that 
some  of  the  interactions  are 

nonsignificantfp'*.  10)  .  Tfiat  fact,  together  with 
an  examination  of  the  marginal  probabilities  of 
bu>ing  at  the  two  time  points,  leads  one  to  a 
reduced  model  that  contains  two  primary  effects. 
One  IS  an  effect  due  to  seeing  at  least  one  ad, 
winch  may  reflect,  in  part,  exposure  to  the 
medium  (or  media)  in  which  the  ads  appear.  The 
other  IS  an  incremental  effect  of  the  first  ad  on 


the  probability  of  buying  the  second  product, 
which  may  reflect,  in  part,  exposure  to  the 
company  selling  the  products.  The  control 
statements  required  to  fit  this  reduced  model  are 

PROC  CATMt)!)  0RI)ER=1)ATA; 

WEIGHT  CDUNT; 

PUPUEATION  SEEl  SEE2; 

RESPONSE  MARGINALS; 

MODEL  BUY1*BUY2  =(100, 

10  0, 

110, 

110, 

110, 

111, 

110, 

111) 

(  1  =  ‘r(BUYI  1  NO  ADS  SEEN)'. 

2  =  'seeing  at  least  one  ad', 

3  =  'effect  of  ad  O.N  BI  Y2  '  ) 

/  FRKO  PRED; 

TITLE  'ADVERTISING  DATA- - -REDUCED  MODEL’; 

The  results  of  fitting  the  reduced  model  are 
shown  in  Figures  27  and  28.  The  four  populations 
are  based  on  whether  or  not  the  subjects  saw  the 
two  advertisements  The  printed  response 
functions  are  the  marginal  probabilities  of  buying 
the  two  products.  1  he  analysis-of- vai'iance  table 
indicates  that  the  model  fits  (p=.40)  and  that  all 
of  the  effects  are  statistically  significant 
(p'.05).  1  l^e  parameter  estimates  and  the 

predicted  marginal  probabilities  are  given  in 
r igure  28 

Eigiii  i'  2  b 

ADVERITSING  DATA- - -SATURATED  MODEL 


ANA!, VS  IS  OK  VARIANCE  TABLE 


SOURCE 

l)K 

cm  -SIH'ARE 

PROB 

INTKRCKKT 

1 

468.85 

O.OOOl 

SEE  1 

1 

33.60 

0. DODJ 

SF,K2 

1 

12.40 

0.0004 

.SKKr-.SKK2 

1 

12.72 

0.0004 

TIME 

1 

1 .06 

0.3025 

SKK1--  KKSroMSK 

1 

4.  13 

0.0420 

SEK2''-  RKSIH'S.SK. 

1 

0 . 04 

0.8459 

SEK.  1  ’■••SEE2>'.  KESroNSE 

1 

n.23 

0.6300 

KESIIH'AI, 

0 

0.00 

1 . 0000 

NOTE:  RESPONSE  =  TIME 


Figure  27 

ADVERTISING  DATA- - -REDUCED  MODEL 


POPULATION  PKOFl LES 

SAMPLE 


SAMPLE 

SEE  1 

SEE  2 

■SIZE 

1 

NO 

NO 

nO') 

2 

NO 

YES' 

75 

3 

YES 

NC 

81 

4 

YK.S 

YES 

181 

(Fig.  27  continued  on  next  page) 


(F»g.  27  continued  from  previous  page) 

KF.SPONSfl  FREQURSCIKS 


SA'Ii’I.E 

1 

RESrt'N'SE  NUMBER 

2  3 

4 

1 

95 

15 

6 

493 

2 

2  5 

10 

8 

32 

3 

35 

7 

1  1 

28 

A 

83 

8 

T  '> 

68 

FUNCTION 

RESPONSE 

DESICN  M.ATRIX 

SAMPLE  NUMBER 

FUNCTION 

1  2 

3 

I  1 

0.  180624 

1  0 

0 

•) 

0. 165846 

1  0 

0 

2  1 

0.466667 

1  1 

0 

2 

0.44 

1  1 

0 

3  1 

0.518519 

1  1 

0 

2 

0.567901 

1  1 

* 

4  1 

0.502762 

1  1 

0 

2 

0.5801 1 

1  1 

1 

ANALYSIS  OK  VAKIANCF.  TABLK 

SOURCE 

l)F  CHI-Sgl'ARF. 

I’ROB 

I’dtrvi  1  NO  ADS 

SEEN  1 

1  n:>,Ti 

0.0001 

SKI'.  ISO,  AT  I.F.AST 

ONF.  Al) 

1  113.38 

O.OOOl 

KFFFCT  OF  Al)  V 1 

ON  BUY2 

1  9.1 0 

0.0026 

KESIDUAi, 

5  3.15 

0.3973 

Figure 
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ADVERTISING  DATA- 

--REDUCED  MODEL 

ANALYSIS  OF  INDIVIIII'AL  PARAMFI'F-RS 

STASDAKD  I'll! 

EFFECT  I’ARAMF.TF.R  ESTIMATE  ERKt'K  Si.«  ,\KI  IR"H 

MOOEl,  1  0  1  ;i  r.2  »  I  M  I  I'  . . 1 
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8.  COMPLEX  SAMPLE  SURVEY  DATA  ANALYSIS 

Currently,  the  CATMOD  procedure  is  not  suited 
for  t(ie  analysis  of  complex  sample  survey  data 
because  CATMOD  computes  covariance  estimates 
under  tl)e  assumption  that  the  frequencies  were 
obtained  by  sti-atified  simple  random  sampling. 
However,  the  IML  procedure  can  be  used  tor- 
such  an  analysis  because  it  contains  a  very 
powerful  programming  language.  This  makes  it 
straightforward  to  progi'am  a  general  linear- 
modeling  algorithm  with  any  desii‘ed  capabilities. 
Since  there  are  already  SAS  procedures  available 
to  compute  weighted  probability  and  covariance 
estimates  for  complex  sample  survey  applications 
(PROC  SURREGRi^^  and  PROC  SESUDAAN?'),  the 
IML  program  could  be  used  to  read  a  function 
vector  and  its  estimated  covariance  matrix,  and 
then  do  general  linear  modeling  of  the  function 
vector.  Such  an  IML  program  has  been  written, 
and  it  is  listed  in  the  appendix. 

Example 


These  data  are  from  the  blood  lead  subsample  of 
Second  National  Health  and  Nutrition  Examination 
Survey  {denoted  NHANES  II,  Reference: 
McDowell,  et  al.  1981).  Only  the  data  for  persons 
under  age  18  in  one  stratum  (out  of  32)  are 
considered  here  (Laridis  and  Lepkowski  1984). 
The  levels  of  the  dependent  and  independent 
variables  are  given  In  Figur*e  29.  The  question  of 
interest  is:  to  what  extent  are  the  variables  age, 
race,  and  income  related  to  the  presence  of 
elevated  levels  of  lead  in  the  blood? 

The  weighted  probability  and  covariance 
estimates  were  computed  with  PROC  SURREGR 
(Landis  and  Lepkowski  1984).  The  IML  program 
WLS  was  tiiet)  rur3  with  three  input  data  sets  in 
order  to  fit  a  saturated  model  via  weighted  least 
squares . 

•  One  data  set.  called  INPUT,  contains  the 
proportion  vector  and  its  estimated 
covariance  matrix,  For  this  example,  the 
estimates  were  typed  in  directly  (Figure 
30).  but  ordinarily,  the  estimates  would  be 
C'litained  in  an  output  data  set  created  by 
PROC  SURREGR. 

•  AfV'thrr  data  set,  called  DESIGN,  contains 
the  (iesign  matrix  (Figure  311. 

•  A  thud  data  set,  called  TEST,  contains  C 
matrups  for  testing  the  hypotheses  C6=0, 
together  with  labels  for  the  hypotheses 
( (  iqui  p  32  )  . 

J  igure  .13  <;hows  that  the  analysis  is  invoked 
b-y  f  .dhng  Ifie  IMI  pr  ogram  WLS  Figure 
33  ai«.f‘  Ui^.pla'v^  the  control  statements  required 
to  a  teihiced  iTiodel 

fiquies  3  1  and  .Gi  cpve  the  resuif.s  of  the 
atod  inochd  .^naRsi'^i  Included  in  the  output 
are  tho  e’;tiniated  paiamnters  and  then*  standard 
errofs  thp  nff'<jirted  f  unctmfis  and  their 


standard  errors,  the  goodness-of-fit  test,  and 
the  analysis-of-variance  table  that  contains  tests 
for  all  the  C$=0  hypotheses  specified  in  the  TEST 
data  set. 

The  results  of  the  reduced  model  analysis,  shown 
in  Figure  36,  indicate  that  the  fit  of  the  model  is 
barely  adequate  (p=.09).  The  age,  race,  and 
income  effects  are  all  statistically  significant 
(a=.05),  with  the  race  and  income  effects  being 
significantly  more  Important  for  the  younger  age 
group. 

Figure  29 

NHANES  11  BLOOD  LEAD  SAMPLE 


Dependent  Var iable 


Blood  Lead  Level 


Independent  Variables 

Age 

Race 

Income 


<20  iJg/dl  20+ 


<=6  6--17 
Black  White 
<$10,000  $10,000+ 


^-•-CREATK  DATASET 
DATA  INPUT; 

INPUT  Pl'Pfl; 
CARDS; 

0,5'J04632l7F.+00 

0.  nsgs^oiAE+oo 

0. 1699744  IDE *02 
0.2'53Ift\^32E*03 

0,9t083hl«3F.*03 

•0,2690'j'3648F.-03 

0,47332I3'>')K*03 
0. 19427h6riE-03  • 

0,4324‘1o310K-03 

0.22930ASh}K-03 

0.2'm8lSi:F.-03  • 

0.  n76‘)'»9l2K-02 

') ,  V»(V>7  r’«'flF,-03 
0,  5rS3i.'3n-hK'03 

0  ■ 

0  -'27  I'M 

(t  ■ 

0  l 


FOR  THE  PROPORTIONS  AND  THE  COVARIANCE  MATRIX---; 


0.3847204)4F.*00 
0. l54j94680E+00 


0.9J0A36I83R-03 

0.390977^5flF.-03 


0.  I82321079e-02 
0.R84567l92E'O3 


0. 30t)286708E-04 
-n, 77S871606E-04 


0.  injH347I0E-03 
0.  t  lfi240l73K-03 


-0, 2690SSe4flF.-03 
0  S2S303046F,-ft3 


O.RR4'>67l92R-03 
0.  J94J  78735E-02 


-<)  1430S1974K-1»3 
0, 1{42Ifl078F-04 


-0  432T)«6h2F.-()4 
0  fil74492'ilF.-04 


0.27326903SE+00 
0, 109570229F.+00 


0.473521399R.(>3 

0,288562763F,-O4 


0,1062fl6708F.-<)4 
•0. 143051974K-03 


0  10362329SK.O2 
0  378;i4792F.-03 


0,47ft34l924F.*m 

0,2S423222hi:-03 


0.  l‘^427^6S>r.-■>3 
0,227r)13')2E*O| 


-0  77>87160M:-('4 
n,3142l80r«F.-'U 


0  1787  ur^.T.-in 
0  423(>2‘»5'>8i;-<>7 


0 

0  qR740B3>9K -04 


0  13643761  ll>00 
0.4:6890868F.-01 


0.-i324'.>6.330E-03 
0. 1084'>r323r.-03 


0.  I0t83-7IOKwij 
-0  432')38663r.-04 


».476S4I'»24F.-I)3 
0. 1702n69I|>0) 


U.  3')M*824;oF.-03 
()  14388  3020i:-n3 


0  22'> 30836  JK-03 
0  1  7462'»45S»:-03 


0  nR2-.o|  731.-01 
oHiriTirsir-oi 


2‘w:’32:’2ftK-o3 
•m7408  n6r*i'4 


r'  U  588  U*20F-03 
0.  n/(»3.M2ir.-'H 


Figure  31 

^---CREATE  DATASET  FOR  THE  DESIGN  MATRIX- 


DATA  UESIGN; 
INPUT  XI -X8; 
CARDS; 

1  1  I 

1  1  1 
1  1  -1 
1  1-1 
1-1  1 
1  -I  1 
1  -1  -1 
1  -1  -1 


1111 
1  -1  -1  -I 
-1  I  -I  -1 
-1-1  1  1 
-I  -1  I  -1 
-1  1-1  1 
1-1-1  1 
111-1 


Figure  32 


-CREATE  DATASET  FOR  THE  HYPOTHESIS  TESTS - 


DATA  TEST; 

INPUT  TITLE  $  1-24  N  C1-C8; 
CARDS ; 

AGE  1  0 

RACE  1  0 

INCOME  1  0 

AGE  *  RACE  1  0 

AGE  *  INCOME  1  0 

RACE  *  INCOME  1  0 

AGE  *  RACE  *  INCOME  1  0 

ALL  INTERACTIONS  ZERO  4  0 


0  1  0  0  0  0  0  0 
0  0  1  0  0  0  0  0 
0  0  0  1  0  0  0  0 
0  0  0  0  1  0  0  0 
0  0  0  0  0  1  0  0 
0  0  0  0  0  0  1  0 
0  0  0  0  0  0  0  1 
0  0  0  0  1  0  0  0 
0  0  0  0  0  1  0  0 
0  0  0  0  0  0  1  0 
0  0  0  0  0  0  0  1 


Figure  33 

* . -CALL  THE  MACRO  TO  DO  THE  ANALYSIS- . ; 

TITLE  'ANALYSIS  OF  COMPLEX  SAMPLE  SURVEY  DATA'; 
TITLE2  'SATURATED  MODEL'; 

WLS 


* - FJT  A  NEW  MODF.L---CHANGE  DESIGN  MATRIX - ; 

TITLE2  'NESTED  MAIN  EFFECTS  MODEL'; 

DATA  DESIGN; 

INPIIT  X1-X6; 

CARDS ; 

1110  10 
1  1  10-10 

11-1010 
I  1-1  0-1  0 


'••--CREATE  DATASET  FOR  THE  NEW 
DATA  TEST; 

INPUT  TITLE  $  1-28  N  C1-C6 
CARDS ; 

MODEL) INTERCEPT  5 


AGE  1 
RACE (AGE)  2 

RACE (AGE  GROUP  I)  1 
RACE (AGE  GROUP  2)  1 
RACE(AGEl)  =  RACE(AGE2)  I 
INCOME (AGE)  2 

INCOMEIAGE  GROUP  1)  1 
INCOME(AGE  GROUP  2)  \ 
INCOME(AGEl)  =  INCOME(AGr.2)  1 


HYPOTHESIS  TESTS-- 


0  1  0  0  0  0 

0  0  1  0  0  0 

0  0  0  1  0  0 

0  0  0  0  1  0 

0  0  0  0  0  1 

0  1  0  0  0  0 

0  0  1  0  0  0 

0  0  0  1  0  0 

0  0  1  0  0  0 

0  0  0  1  0  0 

00  1  -i  0  0 

0  0  0  0  1  0 

0  0  0  0  0  1 

0  0  0  0  1  0 

0  0  0  0  0  1 

0  0  0  0  1-1 


* - CALL  THE  MACRO  TO  DO  THE  ANALYSIS - ; 

WLS 


%' 


-vl 


•'  v;.-  \ .  ■ 


.'‘J 


Figure  3A 


Analysis 

of  Complex 
Saturated 

Sample  Survey 
Mode  1 

Data 

Est  imatecl 

Std 

Predicted 

Std 

Parameters 

Errors 

Functions 

Errors 

0.2335 

0.0184 

0.5905 

0.0411 

0.1128 

0.0127 

0.3847 

0.0427 

0.0930 

0.0147 

0.2733 

0.0325 

0.0539 

0.0108 

0.1365 

0.0199 

0.0484 

.0094979 

0.1760 

0.0371 

0.0318 

.0069225- 

0. 1546 

0.0443 

.0029277 

.0095636 

0. 1096 

0.0206 

0.0143 

.0067561 

0.0427 

0.0108 

ANALYSIS 

Figure  35 

OF  COMPLEX  SAMPLE  SURVEY 

DATA 

SATURATED 

MODEL 

(Fig.  36  continued) 

ANALYSIS  OF  VARIANCE  TABLE 


MODEL) INTERCEPT 
AGE 

RACE (AGE) 

RACE(AGE  GROUP  1) 

RACE (AGE  GROUP  2) 

RACE(AGEl)  =  RACE(AGE2) 
INCOME (AGE) 

INCOME (AGE  GROUP  I) 
1NC0NE(AGE  GROUP  2) 

INCOME  (AGED  =  INCOME  (AGE2) 

9.  ACKNOWLEDGEMENTS 


DF  CHI-SqR  PROB 

5  269.8816  0.0001 

1  82.3638  0.0001 

2  58.1268  0.0001 

1  57.8449  0.0001 

1  5.5449  0.0185 

1  27.0114  0.0001 

2  63.0541  0.0001 

1  61.3218  0.0001 

1  12.2894  0.0005 

1  17.2909  0.000 1 


ANALYSIS  OF  VARIANCE  TABLE 


The  author  is  indebted  to  Dr.  J.  Richard  Landis, 
who  organized  the  categorical  data  analysis 
session,  and  who  provided  a  copy  of  his  tutorial 
(with  Lepkowski)  on  complex  sample  survey  data 
analysis.  The  Graphic  Arts  Department  of  SAS 
Institute  was  instrumental  in  preparing  the 
camera-ready  copy  of  this  manuscript. 


SOURCE 

DF 

CHl-sqR 

PROB 

SAS  is  the  registered  trademark  of  SAS  Institute 

Inc.,  Cary,  NC,  USA 

AGE 

1 

78.7373 

0.0001 

RACE 

i 

40.2621 

0.0001 

PROC  SURREGR  and  PROC  SESUDANN  are 

INCOME 

1 

24.9236 

0.0001 
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AGE  *  RACE 

1 

25.9616 

0.0001 

AGE  *  INCOME 

1 

21.0859 
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0.7595 

10.  REFERENCES 

AGE  *  race  *  INCOME 

1 

4.4832 

0.0342 

ALL  INTERACTIONS  ZERO 

4 

51.0695 

0.0001 

1.  Agresti,  Alan  (1984),  Analysis  of  Ordinal 

Figure  36 

ANALYSIS  OF  COMPLEX  SAMPLE  SURVEY  DATA 
NESTED  MAIN  EFFECTS  MODEL 


ESTIMATED  PARAMETERS  (B) 

0.2220 
0.1138 
0. 1325 
0.0365 
0.0786 
0.0313 


GOOUNESS-OF-FIT  TEST 


DF  Clir-SQR 


Categorical  Data,  New  York:  John  Wiley  t 
Sons,  Inc. 

2.  Birch,  M.  W.  (1965),  "The  Detection  of 
Partial  Association,  II:  the  General  Case," 
Journal  of  the  Royal  Statistical  Society,  B, 
27,  111-124. 

3.  Bishop,  Y.,  Fienberg,  S.E.,  and  Holland, 
P.W.  (1975),  Discrete  Multivariate 
Analysis:  Theory  and  Practice,  Cambridge: 
The  MIT  Press . 

4.  Breslow,  N.  E.  and  Day,  N.  E.  (1980), 
Statistical  Methods  in  Cancer  Research, 
Volume  1;  The  Analysis  of  Case-Control 
Studies,  Lyon,.  International  Agency  for 
Research  on  Cancer. 

5.  Cochran,  William  G.  (1954),  "Some  Methods 
for  Strengthening  the  Common  X*  Tests," 
Biometrics,  10,  417-451. 

6.  Cox,  D.R.  (1970),  The  Analysis  of  Binary 
Data,  New  York:  Halsted  Press. 

7.  Fienberg,  S.E.  (1980),  The  Analysis  of 
Cros.s -C 1  ass  i f  i ed  Categorical  Data,  Second 
Edition,  Cambridge,  Massachusetts:  The 
MIT  Press. 


8.  Goodman,  L.A.  and  Kruskal,  W.H.  (1979), 
Measures  of  Association  for  Cross 
Classification,  New  York:  Springer- 

Verlag.  (Reprints  of  JASA  articles  above). 

9.  Grizzle,  J.E.,  Starmer,  C.F.,  and  Koch, 
G.G.  (1969),  "Analysis  of  Categorical  Data 
by  Linear  Models, "  Biometrics  25,  489-504. 

10.  Haldane,  J.B.S.  (1955),  "The  Estimation 
and  Significance  of  the  the  Logarithm  of  a 
Ratio  of  Frequencies,"  Annals  of  Human 
Genetics,  20,  309-314. 

11.  Imrey,  P.B.  (1985),  "SAS  Software  for  Log- 
Linear  Models,"  Proceedings  of  the  Tenth 
Annual  SAS  Users  Group  International 
Conference,  Cary,  NC:  SAS  Institute  Inc. 

12.  Imrey,  P.B.,  Koch,  G.G.,  and  Stokes, 

M.E.  (1981),  "Categorical  Data  Analysis: 
Some  Reflections  on  the  Log  Linear  Model 
and  Logistic  Regression.  Part  I:  Historical 
and  Methodological  Overview," 

International  Statistical  Review  49,  265- 

283. 

13.  Imrey,  P.B.,  Koch,  G.G.,  and  Stokes, 

M.E.  (1982),  "Categorical  Data  Analysis: 
Some  Reflections  on  the  Log  Linear  Model 
and  Logistic  Regression.  Part  II:  Data 
Analysis,"  International  Statistical  Review 
50,  35-63. 

14.  Kleinbaum,  David  G.,  Kupper,  Lawrence 
L.,  and  Morgenstern,  Hal  (1982) 
epidemiologic  Research:  Principles  and 
Quantitative  Methods,  Belmont,  California: 
Wadsworth,  Inc. 

15.  Koch,  G.G.,  Landis,  J.R,,  Freeman,  J.L., 
Freeman,  D.H.,  and  Lehnen,  R.G.  (1977), 
"A  General  Methodology  for  the  Analysis  of 
Experiments  with  Repeated  Measurement  of 
Categorical  Data,"  Biometrics  33,  133-158. 

16.  Landis,  J.R.,  Heyman,  E.R.,  and  Koch, 
G.G.  (1978),  "Average  Partial  Association 
in  Three-way  Contingency  Tables,  a  Review 
and  Discussion  of  Alternative  Tests," 
International  Statistical  Review,  46,237- 
254. 

17.  Landis,  J.R.  and  Koch,  G.G.  (1977),  "The 
Measurement  of  Observer  Agreement  for 
Categorical  Data,"  Biometrics  33,  159-174. 

18.  Landis,  J.R.  and  Lepkowski,  J.M.  (1984). 
Tutorial  on  The  Analysis  of  Categorical  Data 
from  Complex  Sample  Surveys.  Unpublished. 

19.  Landis,  J.R.,  Stanish,  W.M.,  Freeman, 
J.L.  and  Koch,  G.G.  (1976),  "A  Computer 
Program  for  the  Generalized  Chi-Square 
Analysis  of  Categorical  Data  Using  Weighted 
Least  Squares,  (GENCAT),"  Computer 
Programs  in  Biomedicine  6,  196-231, 


20.  Mantel,  N.  and  Haenszel,W.  (1959), 
"Statistical  Aspects  of  the  Analysis  of  Data 
from  Retrospective  Studies  of  Disease," 
Journal  of  the  National  Cancer  Institute, 

22,  719-748. 

21.  Mantel,  N,  (1963),  "Chi-square  Tests  with 
One  Degree  of  Freedom;  Extensions  of  the 
Mantel-Haenszel  Procedure,"  Journal  of  the 
American  Statistical  Association,  58,  690- 
700. 

22.  McDowell,  A.,  Engel,  A.,  Massey,  J.T., 
and  Maurer,  K.  (1981).  "Plan  and  Operation 
of  the  Second  National  Health  and  Nutrition 
Examination  Survey,  1976-80."  Vital  and 
Health  Statistics,  Series  1,  No.  15,  DHHS 
Publication  No.(PHS)  81-1317,  Public  Health 
Service,  Washington:  U.S.  Government 
Printing  Office. 

23.  SAS  User's  Guide;  Statistics  (1985), 
Version  5  Edition,  Cary,  NC:  SAS  Institute 
Inc. 

24.  Stanish,  W.M.  and  Koch,  G.G.  (1984),  "The 
Use  of  CATMOD  for  Repeated  Measurement 
Analysis  of  Categorical  Data,"  Proceedings 
of  the  Ninth  Annual  SAS  Users  Group 
International  Conference,  Cary,  NC:  SAS 
Institute  Inc. 

25.  Wald,  A.  (1943),  "Tests  of  Statistical 
Hypotheses  Concerning  General  Parameters 
when  the  Number  of  Observations  is  Large," 
Transactions  of  the  American  Mathematical 
Society  54,  426-482. 

26.  Woolf,  B.  (1955),  "On  Estimating  the 
Relationship  between  Blood  Group  and 
Disease,"  Annals  of  Human  Genetics,  19,  251- 
253. 


.'•'.V.h.'.t'.M.'  V.  V.  v-. 


APPr.NDtX 


*  DEFINE  A  HYPOTHESIS  TESTING  HODUI^ 


*  nUKISF.  A  MACRO  FOR  WEUIHTKD  LF.AST  SQUARES  ASAlASIS  THAT  CAN  ACCEPT 
[lIKtCT  INPUT  OF  A  FUNCTION  VECTOR  AND  ITS  COVARIANCE  MATRIX 


MACRO  Wl.S 


REAM  IS  THE  VECTOR  OF  PROPORTIONS  AND  THE  COVARIANCE  MATRIX 


I'H(X:  IMI.; 

I  SE  IM’IT;  READ  AM.  INTO  IN; 

g^Ncr'M  isi ; 

F  =  r  I S  (  1  1  .  1  1  » •  ; 

VV  =  IM  !2:Q+l.M; 

VF  INV  '  ISVIVF); 


*  INVOKE  THE  MATRIX  PWJCEWRE 

*  READ  FROM  INPUT  DATA  SET 
NUMBER  OF  PROPORTIONS 
WEIGHTED  PROPORTION  VECTOR 

*  COVARIANCE  OF  pROI*ORTIONS 
«  INVERSE  COVARIANCE  MATRIX 


READ  IS  DESIGN  MATRIX.  ASI)  SET  UP  FOR  WElGHITin  f.EA.ST  SQ('AR£S 


I  SF.  IlFSlGS;  READ  .1.1,  INi  X; 
t'aSCUl.i  X  1  ; 

H  =  X‘  VF  INV  X; 

(i  =  X'  ■  VF  INV  F; 

T"T  ='  y  VF  INV  *  F; 

W  =  IHI  |G)  //  I  ITOTl  i 


*  READ  desh;n  matrix 

Nl^MBER  OF  rOM'MNS  IN  X 

*  CROSSPRODUCT  OF  X  WITH  X 

*  CRos.'^pRonucT  or  x  with  r 

*  TOTAL  SOM  («•  SQl'ARF.S 
»  COMPITE  SWEEP  MATRIX 


GET  WEIGHTED  LF.AST  SQUARES  SOUTION  AND  G'^ODNESS-OF-FIT  TEST 


P  =  cwr.FPtW  .IT). 

P!  lA  *  B(  I  I  r,T*1 I  I; 

VPr.TA  -  Bill  T.  1  n  ) ; 

SFPirA  =  S','RTi\ECniAG(\BFTAl). 

K!  S  :  Bi  MM  .TM  I  t . 

RANK  =  SI ‘ll'.f.rt*tAc;(  VBETA  I '«0l  . 
nmf  s  *  g  .  hank . 
iHAf  --  X  •  HFIA. 

VitlAl'  ’  \  VPITA  '  X'. 

‘iil  llM  -  S'.'VrMECnUGMFHATu  . 

'■  I  AK  I  . 

If  iHKt  srii  nos  rvAi 

lA  M.  =  i-PK'iBCHMRF,S.DFLi.Si. 
i  IMSH.  KIN. 


*  SWEEP  TO  SOl.VE  THE  EQI'ATIONS 

’>  VECTOR  OF  ESTIMATED  PARAMETERS 
•>  CO\ARI.VSCE  MATRIX  OF  BETA 

*  STANDARD  ERRORS  OF  BETA 
'>  RESIDUAL  SUM  OF  SQUARES 

COMPtTE  »F  FOR  CHI-.SQIARF. 

*•  residual  decrees  of  FREEDOM 

*  PSEOICTFt*  PROIORTIOSS 
C<»\ARtAS/:F.  MATRI.X  0|  FHAT 

’  STASI'XR!)  errors  OF  FUAT 
START  IF-IHEN-EI.se  MODULE 
•>  SFECML  CA.SE- -.sat;  RATED  MODFL 
COMPITE  CORRESroSDlSG  P-VAU'E 
»  FINISH  IF-TIIES-ELSE  MODUI.E 


PRINT  INPIT  DATA.  OESIGS  MATRIX.  AND  ESTIMATED  PARAMITER.S 


RESET 

PRIST 


pRisr 

i'RlST 

PRINT 

I'RlST 

PRINT 

I’KJSr 

FRIST 

FRIST 

PRIST 


V'SAMF. . 

"OPSEKVED  VUNCTinSS"; 

M  ^  ..  .. 

■VOVARIANCK  OF  FUNCMONS'*; 

"  "i  PRINT  "  " 

"DESIGN  MATRIX"; 

"  PRINT  ”  " 

"KSTIMATED  PARAMETER.s  <B)"; 
"  PRINT  "  ■' 

"K.STIMATED  .STANDARD  ERRitRS 
"  PRIM  '■ 

'■predicted  functions"; 

"  PRIST  "  " 


■»  NO  NAMES  ON  MATRICES 
PRINT  F. 

PRINT  \F(F')RMAr»El2  ). 
PRIST  MloKMAI^BESTfl  1; 
PRIST  BETA; 

OF  R".  PRIST  SEBETA. 

PRIST  rilAT; 


PRlsr  "ESTIMATED  STASDAKI)  ERRORS  OF  PREDIC.IFD  Ft  MTIOSS" . 
PRINT  SF.FHAT;  PRINT  "  PRIM . ; 


START  TtSTS; 

DO  I  «  1  TO  NROW(CM); 

N  «  CMIIMI); 

C  »  CM(il-.I+N'1.2:T+l|); 
C  s  PLOCK(C,l); 

WC  »  C  *  B  *  C; 
tfC(|N*I.N+{|)  -  0; 

EC  =  WC<|l:N,N+U); 

VEC  »  WC(| 1 :N, 1 :N| ); 


START  DEFINING  THE  TEST  MODUMJ: 
LOOP  THROUGH  THE  MATRICES 
NUMBER  OE  ROWS  IN  C  MATRIX 
SET  UP  THE  C  MATRIX 
AUGMENT  THE  C  MATRIX 
COMPITT.  THE  SWEEP  MATRIX 
3ERO  LOWER  RT-HAND  EUMENT 
COMPUTE  ESTIMATED  CONTRAST 
COVARIANCE  MATRIX  OF  EC 


*  COMPUTE  THE  CHI-SQUARE  TE.ST  STATISTIC  FOR  THIS  HYPOTHESIS 


B2  *  SWEEPIWC. 1 :N) ; 

QC  *  B2( lN+1 .N+1 I ) ; 

RANK  =  SOM(VECD1AG(B2)'=0)-1; 
PVAL  *  I  -  PROBCHI(QC.RANK); 

IF  PVAL<.0001  THEN  PVA1.=  .0001; 


SWEEP  THE  MATRIX 

*  COHPITE  TEST  .STATISTIC 

*  COMPITE  DF  FOR  CH! -SQUARE 

*  COMF'UTE  CORKESf'rWDlNG  P-VALLT 
ADJUST  FOR  ROUNDOFF  ERROR 


*  ACCUMUIJ^TE  THE  RESULTS  IN  THE  ANOVA  TABLE 


R  »  R  11  TlTLEini); 
SOURCE-SOURCE/ /(RANK  I  IQC I  I  PVAL); 
1  *  I  ♦  N  •  1; 

END; 

FINISH; 


STORE  LABEI.  IN  ROWNAME  MATRIX 
STORE  RESULTS  IN  ANOVA  TABLE 
INCREMENT  TO  GET  NEXT  CONTRAST 
END  PROCESSING  OF  CONTRAST 
FINISH  DEFINING  TEST  MODULE 


*  CCmPUTE  HYPOTHESIS  TESTS  AND  SET  UP  ANOVA  TABLE 


USE  TEST; 

READ  ALL  INTO  CHlROWNAME=TITLF.) ; 
SOURCE  »  I  3 ; 

RUN  TESTS; 

R  «  (R( I .2:NCOL<R) I ))■ ; 

SO(.'R(:E»-.SO(  RCE(  1 2 :  NROW(.S<>I.  RCE  ) .  I ) ; 
Dr2  =  SOIRCEI i . 1  I); 

CH2  »  SOIRCEI  I  .21  ); 

PR2  s  SOURCE! I, 3(); 


*  MAKE  TEST  THE  CURRENT  DATASET 

*  READ  IN  ALL  C  MATRICES 

*  INITIALIZE  TABLE 

*  RUN  THE  HYPOTHE.SIS  TESTS 

*  DEl.ETE  FIRST  Et.F.HENT  OF  R 

*  DELETE  FIR.ST  ROW  OF  TABLE 
COLUMN  FOR  DEGREES  OF  FREEIXIM 

*  COLUMN  FOR  CHI-StjUARE 
COLUMN  FOR  P-VALIE 


*  PRINT  ANALYSIS  OF  VARIANCE  AND  OTHER  HYPOTHESIS  TESTS 


PRINT  ”  PRINT  ”  . 

PRINT  "ANALYSIS  OF  VARIANCE  TABLE"; 

PRINT  RICOLNAME-SRT)  DE2(  COIMME-DPT  FORMAT»BESTS  .  ) 

CH2(COI/JAMF-CHT  FORMAT-A  <*)  rR2(COI>iAMEerRT  PORMAT^B  4}. 


fRivr 

IN'f'UT  DATA  AND  RE.SCLT.S 

OF  MODEL  FiniNG 

SRT  =  "SniKCE  "; 

*  smiRCE  OF  variation 

IIFT  -  ’’ 

DF”; 

*  DEGREES  OF  rKLEDOM 

CUT  =  ■' 

CHI-SQR"; 

*  CHI -SQUARE  STATISTIC 

PRT  =  " 

PKOB 

*  P-VALUE 

R-‘RESIDUAL' 

STAKE 

IF  DFRES 

•0  THEN  no 

PRINT 

"GOODNESS -OK-P IT  TEST" 

PRINT  R(COLNA'IE^SRT)  nFRES(COLNAMr.=OFT  POKMAT-BESTA  ) 

RES{COLNAME=CHT  FORMAT^A  .A)  PVAL{(:')LNA’IF;=rRT  FOHMAT=«  U) . 

END 

ELSE  PRINT  "GonDNESS-OF-FIT  TE.ST  PERFECT  --  SATUKAIED  MODEL" 
FINISH;  RUN 


^  -• 


''  ..V  k 
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CATEGORICAL  DATA  ANALYSIS  IN  BMDP:  PRESENT  AND  FUTURE 


Morton  B.  Brown 

Department  of  Biostatistics 
University  of  Michigan 
Ann  Arbor,  Michigan  48109 


The  BMDP  series  of  statistical  computer  programs  currently  contains  two  programs 
for  categorical  data  analysis.  One  (P4P)  enables  the  user  to  analyze  two-way 
frequency  tables  by  various  statistics,  including  measures  of  association  and  of 
prediction,  or  multiway  tables  by  fitting  hierarchical  log-linear  models.  The  other 
<PLR)  can  be  used  to  fit  logistic  models  to  data  using  arbitrary  design  matrices, 
provided  the  response  variable  is  dichotomous.  Both  programs  have  features  to  build 
models  in  a  sequential  fashion,  such  as  in  a  stepwise  manner. 

The  development  of  P4P  and  its  precursors  is  described  in  relation  to  the  evolving 
methodology  of  analyzing  two-way  and  multiway  frequency  tables.  Issues  of 
computational  accuracy  are  contrasted  with  those  of  statistical  validity. 

A  new  program  for  categorical  data  analysis  is  being  developed.  Its  features 
include  an  ability  to  fit  linear,  log-linear  and  logistic  models.  The  specification 
of  the  models  will  be  either  bj'  macro-level  keywords  or  by  design  matrices.  Both 
ordinal  and  nominal  variables  can  be  used  in  the  models.  The  models  will  be  fitted  by 
either  weighted  least  squares,  iteratively  rewelghted  least  squares  or  iterative 
proportional  fitting.  Methods  for  semi-automatic  model -building  will  be  included. 


li  INTRODUCTION 

The  availability  of  computer  software  for 
the  analysis  of  data  summarized  as  frequency 
tables  has  changed  dramatically  within  the  last 
decade.  Prior  to  1975  the  major  software 
packages  only  computed  statistics  for  two-way 
tables,  and  these  were  limited  to  tests  tor 
independence  (the  chi-squared  test  and  Fisher's 
exact  test)  and  related  statistics. 


The  first  major  package  to  provide  more 
general  methods  to  analyze  contingency  tables 
was  BMDP  [9j.  Its  initial  program  for  frequency 
table  analysis,  FIF,  was  a  conversion  of  a 
program  BMD02S  from  the  earlier  Biomedical 
Computer  Programs  [8].  In  the  next  six  years 
programs  were  added  and  several  (including  PIF) 
were  made  obsolete  by  the  development  of  P4F 
( see  Table  1) . 

PIF  incorporated  measures  of  association  and 
of  optimal  prediction  for  two-way  tables,  but 
otherwise  remained  unchanged  from  BMD02S.  F2F 
was  added  to  allow  models  of  quasi-independence 
in  the  two-way  table.  Included  in  P2F  were 
stepwise  algorithms  for  the  identification  of 
extreme  cells  [3].  The  third  program  P3F  was 
developed  to  fit  log-linear  models  to  data  in 
multiway  contingency  tables  using  an  iterative 
proportional  fitting  algorithm  [14].  Since  BMDP 
was  not  an  interactive  package,  the  user  needed 
an  easy  way  to  Identify  the  subset  of  models 
that  should  be  fitted  to  the  data.  This  led  to 
tests  of  marginal  and  partial  association  [2,5K 


Table  l!  The  development  of  computer  programs 
for  the  analysis  of  frequency  tables. 


BMD02S!  CONTINGENCY  TABLE  ANALYSIS 


TWO-WAY  CONTINGENCY  TABLES 


TWO-WAY  CONTINGENCY  TABLES 
MEASURES  OF  ASSOCIATION 
TWO-WAY  CONTINGENCY  TABLES 
EMPTY  CELLS  AND  DEPARTURES 
FROM  INDEPENDENCE 
MULTIWAY  FREQUENCY  TABLES  - 
THE  LOG- LINEAR  MODEL 


STEPWISE  LOGISTIC  REGRESSION 


P4F:  TWO-WAY  AND  MULTIWAY  FREQUENCY 

TABLES  —  MEASURES  OF  ASSOCIATION 
AND  THE  L(X;-LINEAR  MODEL 
(COMPLETE  AND  INCOMPLETE  TABLES) 

Support  of  PIF,  P2F  and  P3F  was  discontinued 
when  P4F  was  released. 


In  1981  P4F  was  released  [10).  P4F  combined 
tne  strengths  of  the  previous  programs  (PIF,  P2F 
and  F3F)  into  a  single  program.  In  addition  to 
the  features  described  above,  it  included  a  more 
flexible  manner  of  identifying  structural  zeros, 
a  stepwise  algorithm  for  model  selection, 
methods  to  identify  extreme  cells  or  strata  and 
the  Mantel -Haenszel  statistic  when  a  set  of  2x2 
tables  are  analyzed.  Since  its  release  we  have 
made  corrections  that  affect  the  computations 
for  data  in  sparse  tables  [2]  and  in  tables  with 
structural  zeros. 

2^  THE  AMALYSIS  OF  TWO-WAY  TABLES 

The  first  version  of  PIF  included  many 
measures  of  association  (or  correlation)  and 
prediction.  In  retrospect,  these  measures  and 
their  standard  errors  were  computed  without 
considering  the  implications  of  the  sampling 
framework.  For  example,  the  estimate  of  the 
standard  error  of  the  correlation  coefficient 
used  a  formula  that  assumed  that  the  data  were 
normally  distributed  Instead  of  suinnarlzed  in  a 
contingency  table. 

Brown  and  Benedetti  [6]  studied  various 
approximations  for  the  standard  errors  of 
measures  of  correlation  and  association  for  data 
suimnarized  as  contingency  tables.  Using  the 
delta  method  [12,13],  they  derived  asymptotic 
standard  error  formulas  for  the  product-moment 
correlation  and  Spearman  rank  correlation.  In 
addition,  they  found  a  modification  that 
appeared  to  be  less  optimistic  when  used  to  test 
the  null  hypothesis  that  the  correlation  or 
association  is  zero. 

Brown  and  Benedetti  (unpublished)  used  the 
same  type  of  expansion  to  derive  formulas  for 
the  asymptotic  standard  errors  of  measures  of 
prediction  under  the  null  hypothesis  and  added 
these  formulas  to  the  program  in  1977,  but 
unfortunately  the  small-sample  behaviors  of 
these  statistics  were  not  checked  by  simulation 
at  that  time.  After  simulations  showed  that  the 
test  statistics  did  not  have  reasonable 
empirical  sizes  under  the  null  distribution, 
these  asymptotic  standard  error  formulas  for 
predictive  measures  were  eliminated  in  1981. 


Table  2  presents  an  exanple  of  a  two-way 
frequency  table  from  the  first  version  of  the 
BMDP  manual.  The  data  in  this  table  are 
reanalyzed  by  the  current  program  P4F. 

Statistics  printed  by  PIF  and/or  P4F  are  listed 
in  Table  3.  As  can  be  seen  from  the  table,  some 
statistics,  primarily  those  involving  standard 
errors,  have  changed  since  the  inital  release  of 
PIF.  The  date  of  the  change  is  indicated. 

The  only  statistics  modified  were  the 
uncertainty  coefficients.  In  deriving  standard 
errors  for  these  coefficients.  Brown  [4]  noted 
that  the  coefficients  were  not  normalized  to  lie 
in  the  range  from  zero  to  one.  The  asymmetric 
coefficient  was  unbounded,  whereas  the  symnetric 
coefficient  could  not  exceed  one-half. 
Modifications  in  these  coefficients  were  made  to 
normalize  them  to  lie  in  the  range  from  zero  to 


Table  3;  A  comparison  of  statistics 
produced  by  PIF  and  P4F. 

Unchanged : 

CHl-SOUARE  MAXIMUM  LIKELIHOOD  CHI-SQUARE 

PHI  CONTINGENCY  COEFFICIENT  C 

CRAMER’S  V  YULE’S  Q  AND  Y 

CROSS-PRODUCT  RATIO 

FISHER’S  EXACT  TEST  (1-TAIL  and  2-TAIL) 
MCNEMAR’S  TEST  OF  SYMMETRY 

Added: 

TETRACHORIC  CORR  (added  1977) 

RELATIVE  RISK  —  HANTEL-HAENSZEL  (added  1981) 
KAPPA  (added  1982) 

Changed ; 

A)  ASSOCIATION  AND  CORRELATION 


VALUE 

ASE 

VAL/ASEO 

Date  of  release: 

1975 

1977 

1975 

1977 

CORRELATION 

-.374 

.101 

.082 

-3.72 

-4.35 

SPEARMAN  CORR 

-.422 

.098 

.087 

-4.29 

-4.94 

GAMMA 

-.478 

.100 

ft 

-4.80 

-4.81 

KENDALL  TAU-B 

-.344 

.098 

.073 

-4.71 

-4.81 

STUART  TAU-C 

-.355 

.074 

ft 

-4.81 

ft 

SOMERS  D 

-.384 

.085 

ft 

-4.54 

-4.81 

-.307 

.063 

ft 

-4.85 

-4.81 

Table  2;  Example  of  a  two-way  frequency  table 
from  Dixon  ([9],  page  293) 

CELL  FREQUENCY  COUNTS 


DR.  A 

SECTION 
DR.  B 

DR.  C 

TOTAL 

ATTITUDE 

WORSE 

1 

1 

11 

13 

WORSE -NC 

1 

C 

10 

11 

NOCHANGE 

8 

4 

16 

28 

NC-BETTR 

11 

7 

5 

23 

BETTER 

1 

8 

3 

12 

TOTAL 

22 

20 

45 

87 

B)  OPTIMAL  PREDICTION  AND  UNCERTAINTY 


VALUE  ASE  VAL/ASEO 


Date  of  release: 

1975 

1977 

1975 

1977 

1975 

1981 

LAMBDA-STM 

.178 

ft 

.089 

. 

2.00 

N/A 

LAMBDA-A5YM 

.119 

ft 

.088 

ft 

1.35 

N/A 

LAMBDA-*-ASyM 

.144 

.179 

.099 

.075 

1.46 

N/A 

TAU-A5YM 

.094 

ft 

.030 

• 

3.13 

N/A 

UNCERTAIN-SYM 

.082 

.164 

.0 

.046 

0. 

N/A 

UNCERTAIN-ASYM  1 

.912 

.137 

.830 

.039 

2.30 

N/A 

*  unchanged  N/A  no  longer  printed 


*1 


one.  (The  change  In  lambda-star  was  due  to  an 
error  in  programing . ) 

Once  Brown  and  Benedettl  [6]  derived 
improved  estimators  for  the  standard  errors 
under  the  null  hypothesis  of  the  measures  of 
association  and  correlation,  we  included  two 
different  standard  errors  (ASE  and  ASEO)  for 
each  statistic.  Under  the  heading  ASE  is  the 
asymptotic  standard  error  to  be  used  in  building 
confidence  intervals  for  the  expected  value  of 
the  statistic.  A  test  of  the  hypothesis  that 
the  expected  value  of  the  statistic  is  zero  is 
given  by  the  ratio  of  the  statistic  to  its 
asymptotic  standard  error  under  the  null 
hypothesis  (ASEO) ;  this  ratio  is  printed  under 
the  heading  VAL/ASEO. 

The  above  history  raises  several  issues. 

The  changes  in  the  formulas  occurred  as  a  result 
of  work  by  Benedettl  and  myself .  Some  packages 
avoid  the  problem  1^  not  including  standard 
errors  while  others  use  formulas  that  are 
inappropriate  for  the  sampling  framework.  The 
casual  user  of  a  statistical  program  does  not 
have  the  ability  to  evaluate  the  quality  or 
source  of  approximations  used  within  a  progrm, 
especially  when  asymptotic  expansions  are 
involved.  Also,  it  is  difficult  to  check 
whether  formulas  are  correctly  Implemented. 
Although  now  there  are  more  journals  that  will 
accept  articles  that  evaluate  the  quality  of 
approximations  or  compare  programs,  these 
articles  are  not  read  widely  by  the  community 
that  uses  these  programs  for  analysis.  What  are 
the  program  developers'  responsibilities  to  the 
research  coimunlty  .tbat  uses  and  trusts  the 
software  developed? 

Table  «i  Some  capabilities  of  P4F. 

FORMS  OF  IHPUT: 

CASEWISE 

AS  CELL  FREOUENCIES 
AS  A  MULTIWAY  TABLE 

TWO-WAY  COMPLETE  TABLE! 

ALL  STATISTICS  DESCRIBED  ABOVE 

TWO-WAY  INCOMPLETE  TABLE: 

MODELS  OF  gUASI-INDEFEKDENCE 
IDENTIFICATION  OF  EXTREME  CELLS 

multiway  TABLES: 

LOG-LINEAR  MODELS 

MODEL  SCREENING  AND  BUILDING 

IDENTIFICATION  OF  EXTREME  CELLS 
IDENTIFICATION  OF  EXTREME  STRATA 

SPECIFICATION  OF  STRUCTTURAL  ZEROS 
SPECIFICATION  OF  INITIAL  FIT  MATRIX 

PARAMETER  ESTIMATION  OF  LOG-LINEAR  MODELS 
STD  ERRORS  FOR  THE  PARAMETER  ESTIMATES 
COVARIANCE  MATRIX  OF  PARAMETER  ESTIMATES 

CELL  DEVIATES 


3^  Tia  CAPABILITIES  OF  P4F 

The  program  P4F  was  planned  to  replace  all 
the  categorical  programs  previously  developed 
(PIF,  P2F  and  P3P).  Many  of  the  capabilities  of 
P4F  are  listed  in  Table  4. 

Since  P4F  can  be  used  to  fit  log-linear 
models  to  multiway  frequency  tables,  it  is  often 
used  to  analyze  or  reanalyze  data  that  are 
already  sunmarized  in  a  (multiway)  frequency 
table.  Therefore,  three  methods  of  input  are 
acceptable:  raw  data  in  a  case-by-variable 
format,  processed  data  as  cell  Indices  and 
frequencies  and  final  data  summarized  as  cell 
counts  in  a  frequency  table. 

All  the  statistics  for  the  two-way  table 
were  carried  over  from  PIF  to  P4F .  The  Mantel- 
Haenszal  and  kappa  statistics  were  added. 

A  major  goal  for  the  development  of  P4F  was 
to  make  available  to  a  wide  audience  the  ability 
to  describe  the  relationships  among  the  factors 
of  a  multiway  frequency  table  by  log-linear 
models.  There  was  a  need  to  provide  an  easy 
manner  to  specify  models  and  to  Identify 
possible  models. 

Log-linear  models  are  specified  by  listing 
the  factors  or  Interactions  in  the  minimal 
configuration.  If  a  redundant  list  is  provided, 
the  extra  terms  will  be  ignored.  All  models  are 
assumed  to  be  hierarchical.  That  is,  if  a 
higher-order  interaction  is  specified,  all 
lower-order  interactions  and  main  effect  that 
are  specified  by  subsets  of  the  interaction  are 
automatically  included  in  the  model. 

Since  there  are  many  possible  log-linear 
model  when  a  table  is  multidimensional,  it  was 
necessary  to  include  some  methods  that  aid  in 
the  identification  of  models.  When  the  table  is 
two-  or  three-way  it  is  possible  to  enumerate 
and  evaluate  all  the  possible  hierarchical 
models  at  a  reasonable  cost  and  time.  However, 
for  four-way  and  higher  tables  it  is  necessary 
to  screen  the  interactions  for  those  likely  to 
contribute  to  the  final  model.  Brown  [5]  (see 
also  [2]'  proposed  using  tests  of  marginal  and 
partial  association  to  screen  the  interactions. 
These  tests  are  computed  by  PtF  when  the 
appropriate  keyword  is  specified. 

In  addition,  the  user  can  request  that 
effects  and/or  interactions  be  added  or  deleted 
from  a  base  model  in  a  stepwise  manner.  This 
option  is  very  useful  when  used  in  conjuction 
Kith  the  tests  of  marginal  and  partial 
association.  The  tests  are  used  to  screen  for  a 
starting  (base)  model  and  then  the  stepwise 
procedure  is  used  to  evaluate  the  effect  of 
adding  or  deleting  terms  from  the  model. 

The  user  can  identify  cells  that  are  to  be 
treated  as  structural  zeros;  these  cells  are 
excluded  from  all  analyses.  Brown  (3]  presented 
fin  algorithm  to  Identify  extreme  cells 
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(outliers)  such  that  at  each  step  the  most 
extreme  cell  was  eliminated  and  treated 
thereafter  as  a  structural  zero.  To  evaluate 
the  Influence  of  these  extreme  cells,  the 
expected  values  of  these  cells  were  estimated 
from  the  109-llnear  model  fitted  to  all  cells  as 
yet  not  eliminated  and  not  defined  as  structural 
zeros.  In  P4F  each  cell  defined  as  a  structural 
zero  will  have  Its  expected  value  estimated  In 
the  manner  described  for  eliminated  cells.  This 
Is  similar  to  the  calculation  of  deleted 
residuals  In  regression. 

The  usual  manner  In  which  the  parameters  of 
the  log-linear  model  are  estimated  within  PlF  Is 
by  applying  the  ANOVA  formulas  to  the  logarithms 
of  the  estimated  expected  values.  This  solution 
Is  not  possible  when  either  structural  zeros  are 
specified  or  at  least  one  of  the  marginal  cells 
In  a  configuration  of  the  model  is  zero;  l.e., 
there  are  zero  expected  values.  In  either  of 
these  situations  P4F  forms  a  variance-covariance 
matrix  and  estimates  the  parameters  by  sweeping 
(or  partially  sweeping)  this  matrix.  This 
procedure  will  give  correct  estimates,  although 
the  solution  may  no  longer  be  unique;  l.e.,  the 
problem  may  be  overparameterized  [7], 

Some  of  the  limitations  of  P4F  are  described 
In  Table  S. 

Sparse  data  In  contingency  tables  can  cause 
problems  of  numerical  accuracy  and  of 
statistical  Interpretation.  A  sparse  table  Is 
one  In  which  there  are  many  cells  with  small 
expected  values  and  one  or  more  observed  zeros. 
When  the  pattern  of  observed  zeros  creates  zeros 
In  a  marginal  subtable  corresponding  to  one  of 
the  configurations  In  the  model,  there  can  be 
numerical  problems  In  the  estimation  of 
parameters,  of  expected  values  and  of  degrees  of 
freedom  [7].  Care  In  Implementations  of  the 
algorithms  can  alleviate  some  of  the  numerical 
problems,  but  cannot  guarantee  their  absence. 
Overparameterized  models  with  nonestlmable 
parameters  can  occur. 


Table  5;  Known  problems  and  limitations  of  P4F 
SPARSE  TABLES: 

WHEN  MARGINAL  ZEROS  OCCUR,  TWO  MODELS  BEING 
COMPARED  MAY  DIFFER  IN  THEIR  SETS  OF  CELLS 
WITH  FITTED  VALUES  EQUAL  TO  ZERO 

STD  ERRORS  MUST  BE  OBTAINED  BY  INVERTING 
INFORMATION  MATRIX  —  MAY  REQUIRE  TOO  MUCH 
MEMORY 

NONHIERARCHICAL  MODELS: 

CANNOT  BE  FITTED 

ORDINAL  CATEGORICAL  VARIABLES: 

CANNOT  BE  TAKEN  INTO  ACCOUNT 
(EXCEPT  FOR  MEASURES  OF 
ASSOCIATION  IN  TWO-WAY  TABLE) 


The  small  expected  values  affect  the 
distribution  theory  of  the  statistics.  The 
distribution  theory  underlying  the  chi-square 
statistics  is  large-sample  asymptotic  theory 
which  is  inappropriate  tor  statistics  based  on 
sparse  tables.  Also,  when  the  model  is 
overparametrized,  the  computer  program  will 
print  out  a  solution,  but  there  are  many  other 
equally  good  alternate  solutions  with  differing 
parameter  estimates.  One  approach  often  used  Is 
to  augment  each  cell  by  a  constant.  Although 
this  approach  eliminates  the  numerical  problems, 
it  leaves  the  problems  of  inference  untouched. 

P4F  uses  an  iterative  proportional  fitting 
algorithm  to  estimate  the  expected  values  of  a 
log-linear  model  which  restricts  the  models  that 
can  be  specified  and  fitted.  For  example,  all 
models  must  be  hierarchical.  In  addition, 
models  that  incorporate  the  ordering  of  Indices, 
such  as  those  described  by  Agrestl  [1],  are  not 
available. 


4k  DESIGNING  A  NEW  PROGRAM 

Given  the  rapid  strides  In  developing  new 
models  for  categorical  data,  it  is  necessary  to 
develop  more  flexible  computer  programs  that 
will  allow  the  fitting  of  such  models. 

Some  general  goals  for  a  program  are: 

1)  To  ma)ce  available  new  statistical 
methodology.  For  example,  Goodman  and  KrusKal 
[11,12,13]  proposed  statistics,  such  as  the 
gamma,  lambda  and  tau,  to  estimate  relationships 
among  the  indices  in  the  two-way  frequency 
table.  Other  have  proposed  alternate  measures. 
As  long  as  these  measures  did  not  appear  In 
computer  programs.  It  was  difficult  to  evaluate 
their  usefulness.  To  interpret  the 
meaningfulness  of  the  statistics.  It  is 
necessary  to  compute  their  standard  errors  and 
z-scores . 

2)  To  provide  aids  for  the  unsophisticated 
user.  For  example,  special  purpose  programs  to 
fit  log-linear  models  (E(n'A,  GLIM,  etc)  assume 
that  the  user  knows  which  model  is  to  be  fitted 
to  the  data  based  on  an  a  priori  knowledge  of 
the  variables.  Identification  of  the 
appropriate  model  was  made  by  testing  effects  in 
the  model  or  by  a  stepwise  procedure.  The 
rationale  behind  tests  of  marginal  and  partial 
association  In  F4F  [2,5]  Is  to  enable  the 
investigator  to  screen  all  the  possible 
Interactions  for  their  'maximal'  effect  and  thus 
order  them  In  Importance. 

3)  To  be  easy  for  a  novice  to  use.  This 
last  consideration  Is  critical  when  planning  a 
new  program.  For  example,  how  should  models  be 
specified  In  the  general  case  where  the  model 
may  be  nonhlerarchlcal  or  when  the  factors  are 
ordinal  or  when  the  dependent  variable  Is 
ordinal . 


When  the  only  programs  available  analyzed 
data  in  two-way  tables  and  the  only  statistic 
computed  was  the  chi-square,  it  was  reasonable 
to  assume  that,  if  the  user  can  run  the  program, 
s/he  can  understand  the  output.  When  there  is  a 
program  such  as  P4F  with  a  relatively  simple 
means  to  specify  options,  users  can  request 
options  that  produce  results  which  they  are  not 
trained  to  interpret  correctly.  When  planning  a 
new  program  that  starts  where  P4F  stops,  which 
audience  should  be  addressed: 

— the  unsophisticated  user  in  an  applied  area, 

— the  sophisticated  user  in  the  applied  area, 

— the  statistician  with  a  masters  degree,  or 
— the  advanced  practitioner  of  statistics. 

A  requirement  to  specify  design  matrices 
explicitly  would  indicate  that  the  last  group 
is  the  target  audience.  The  presence  of  a 
totally  automatic  model  search  routine  would 
allow  all  groups  to  use  the  program  and  possibly 
not  understand  the  results.  Therefore,  there  is 
a  need  to  allow  different  levels  of 
sophistication  of  usage,  where  users  at  the 
lowest  level  would  not  need  access  to  all  the 
options  (and  probably  would  not  desire  the 
excluded  options). 

Models  that  are  not  hierarchical,  such  as 
those  of  marginal  symmetry,  cannot  be  fitted 
within  P4F.  In  addition,  the  Internal  structure 
between  cells  cannot  be  specified  to  P4F . 
Therefore,  when  repeated  observations  are  taken 
on  a  variable  and  each  repetition  is  not  treated 
as  a  separate  index,  P4F  is  unable  to  analyze 
the  data. 

Several  forms  of  models  have  been  proposed 
for  categorical  data.  The  two  most  commonly 
used  at  this  time  are  the  log-linear  model  where 

In  p  •  linear  model 

and  the  logistic  regression  model  where 

In  [p/(l-p)]  «  linear  nodel. 

Alternative  models  Include  writing  on  the  left- 
hand  side  either  p  or  the  odds-ratio  or  some 
other  function  of  one  or  more  p's. 

When  the  independent  variables,  or  factors, 
are  not  ordered,  the  usual  representation  of  the 
linear  model  is  the  same  as  that  of  an  analysis 
of  variance  model.  The  only  difference  is  that 
in  the  log-linear  model  the  logarithm  of  the 
expected  value,  and  not  the  expected  value 
itself,  has  a  linear  form.  When  one  or  more 
factors  are  ordered,  it  may  be  possible  to  write 
the  linear  nxxlel  using  a  reduced  set  of 
variables  (such  as  the  lower-order  terms  of  an 
orthogonal  decomposition)  for  that  factor,  or 
the  model":  of  Agresti  [1]. 

Classically,  statistics  and  biostatistics 
have  been  concerned  with  fitting  models  to  data 
such  that  the  deviations  of  the  observations 
from  the  model  are  mutually  independent.  More 


recently,  nxxiels  have  been  developed  to  allow 
for  repeated  observations  from  individuals.  In 
these  models  it  is  recognized  that  the  repeated 
observations  from  an  individual  have  less 
variation  than  a  similar  set  of  observations, 
each  obtained  from  a  different  individual. 
Repeated  measures  models  for  categorical  data 
have  primarily  treated  the  situation  when  there 
is  a  single  response  variable,  such  as  voting 
preference,  observed  over  time  for  a  group  of 
individuals.  The  models  that  are  fitted  to  .he 
data,  and  hypotheses  tested,  describe  change 
over  time.  General  imdels  for  repeated  measures 
will  be  able  to  be  fitted  to  the  data  in  the  new 
program . 

Several  methods  of  fitting  the  log-linear 
model  to  categorical  data  will  be  available: 

1)  Maximum  likelihood  (ML)  using  the 
Iterative  proportional  fitting  algorithm  (IFF). 
This  method  is  limited  to  fitting  hierarchical 
models. 

li)  ML  using  a  Hewton-Raphson  algorithm  (HR). 
This  method  may  require  computing  a  large 
covariance  natrix  at  each  iteration. 

ill)  Weighted  least  squares  (WLS).  These 
estinates  are  not  naximum  likelihood.  The 
method  does  not  require  iteration  but  the  same 
covariance  matrix  is  needed  as  for  the  NR 
algorithm. 

Table  6  summarizes  many  of  the  attributes  of 
the  program  that  is  being  developed. 


Table  6:  Attributes  of  the  new  program. 

MODELS  THAT  CAN  BE  FITTED: 

LINEAR 

LOG-LINEAR 

LOGISTIC 

MODEL  SPECIFICATION  BY: 

MACRO-LEVEL  KEYWORDS 
DESIGN  MATRICES 

VARIABLES  CAN  BE: 

NOMINAL 

ORDINAL 

ALGORITHMS: 

ML  USING  IFF 
ML  USING  IRWLS 
WEIGHTED  LEAST  SQUARES 

MODEL-BUILDING: 

SEMI-AUTOMATIC 

INTERACTIVE 

TYPES  OF  MODELS: 

POISSON 
MULTINOMIAL 
REPEATED  MEASURES 
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LOG-LINEAR  MODELING  WITH  SPSS 


Clifford  C.  Clogg  and  Mark  P,  Becker 


The  Pennsylvania  State  University  ^ 

University  Park*  Pennsylvania  16802 

The  recently  released  software  package  SPSS^  contains  two  procedures  for  log-linear 
analysis  of  contingency  tables,  LOGLINEAR  and  HILOGLTNEAR.  LOGLINEAR  is  based  on 
Haberman’s  (1979)  program  KREQ,  and  it  uses  a  Newton-Raphson  algorithm  for  calculating 
maximum  likelihood  estimates.  LOGLINEAR  Is  probably  the  most  general  computer  program 
for  log-linear  analysis  now  included  in  major  software  packages.  HILOGLINEAR  is  based 
on  the  Iterative-proportlonal-f ittlng  (IPF)  algorithm  and  is  restricted  to  hierarchical 
models  that  can  be  expressed  in  terms  of  fitted  marginals.  We  evaluate  these  two 
procedures  according  to  the  following  criteria:  (1)  What  can  be  done  with  the 
procedures?  (2)  Does  the  available  documentation  give  a  suitable  description  of  those 
capabilities?  (3)  Wliat  should  SPSS^  have  done?  (Or,  what  should  they  do  with  these 
procedures  In  the  future?)  (A)  What  diagnostics  and/or  warnings  are  available  or 
could  be  made  available  given  current  knowledge? 


1.  INTRODUCTION 

In  1979  Haberman  Ijitroduced  a  computer  program 
called  FREQ  that  "can  be  used  to  compute  maximum 
likelihood  estimates  for  any  log-linear  model*’ 
(Haberman,  1979,  p.  571).  What  he  meant  was  that 
his  program  could  be  used  to  obtain  Ml.  fits  for 
any  model  for  contingency  tables  that  Is  addi¬ 
tive  in  the  logarithms  of  cell  frequencies,  when 
the  cell  frequencies  arise  from  Poisson,  multi¬ 
nomial,  or  product-multinomial  sampling  schemes. 
There  were  three  main  advantages  of  FREQ  In 
relation  to  others  that  existed  In  the  !970s: 

1.  It  calculated  adjusted  (truly  standardized) 
residuals  (cell  by  cell)  and  generalized  adjusted 
residuals  for  contrasts  among  cells. 

2.  It  allowed  for  adjustment  of  Poisson  frequen¬ 
cies  for  differential  cell-by-cell  exposures, 
thus  permitting  log-linear  analysis  of  rates  of 
rare  events. 

3.  The  Cholesky  factorization  of  the  estimated 
information  matrix  at  successive  steps  in  the 
Newton-Raphson  algorithm  was  done  with  great 
care,  and  analysts  were  thereby  alerted  to  non¬ 
existence  problems  and  related  problems  that 
arise  from  sparse  data  and/or  from  specifications 
of  quasi-log-linear  models. 

The  main  disadvantage  of  FREQ  was  that  users  had 
to  supply  the  model  matrix  (or  design  matrix)  In 
complete  detail,  a  difficulty  that  prevented  Its 
widespread  use. 

In  1983  the  FREQ  program  was  incorporated  in  the 
I.OCl.INF.AR  procedure  of  SPSS^,  The  most  obvious 
difference  between  LOGLINEAR  and  FREQ  Is  that  in 
the  former  the  model  matrix  can  he  created  with 
onlv  a  small  number  of  commands  using  symbolic 
representat ions  for  the  types  of  contrasts  that 
arc  to  be  employed.  The  Kroneckcr  product  oper¬ 
ations  that  build  the  model  matrix  from  the  vari¬ 
able  contrasts  are  performed  automat IrnI ly .  Many 
options  are  available  for  specifying  contrasts. 


quantitative  covarlates  may  be  added  to  a  model 
quite  easily,  logit-type  models  (or  multinomial- 
response  models)  can  be  readily  distinguished 
from  the  wider  class  of  log-linear  models  for 
the  cell  frequencies,  normal  probability  plots 
for  residuals  can  be  obtained,  and  an  analysis 
of  dispersion  including  asymmetric  measures  of 
association  for  loglt-type  models  Is  available. 
LOGLINEAR  is  not  designed  to  be  a  stand-alone 
exploratory  analysis  procedure.  But  once  the 
contingency  table  —  including  both  the  variables 
and  the  categories  used  for  each  —  and  a  rela¬ 
tively  small  number  of  models  for  this  table  are 
specified,  L0(d-1NKAR  Is  probably  the  best  ("most 
general")  program  for  log-linear  models  currently 
in  existence. 

Below  we  describe  briefly  what  LOGLINEAR  can  do, 
whether  the  documentation  provides  a  satisfactory 
description  of  its  capabilities,  and  what  could 
be  done  to  improve  the  program  in  the  light  of 
current  knowledge.  It  is  not  our  purpose  to 
compare  LOGLINEAR  with  other  programs.  In  our 
experience,  analysis  of  contingency  tables  in 
practical  research  settings  usually  requires  the 
use  of  more  tl>an  one  procedure  from  more  than 
one  software  package.  And  it  should  be  acknow¬ 
ledged  that  computing  for  contingency  table 
models  is  very  primitive  compared  to  computing 
for  linear  models.  We  are  a  long  way  from 
having  computational  equipment  that  Is  ns  flex¬ 
ible  —  and  as  believable  —  as  the  procedures 
REG  and  GI.M  in  the  SAS  package.  And  we  are  even 
further  from  the  development  of  intelligent  soft¬ 
ware  like  the  REX  program  of  Bell  Laboratories 
for  regression  analysis  (Hahn,  1985;  Gale  and 
Preglhon,  1982),  Our  goal  is  not  to  make 
Invidious  comparisons  hut  rather  to  assess 
strengths  and  weaknesses  of  the  particular  pro¬ 
gram  under  review.  More  borrowing  of  Ideas 
among  software  developers  Is  called  for,  and  we 
hope  tliat  the  present  review  points  to  areas 


y •a*. 


where  such  borrowing  is  most  likely  to  be  bene- 
f  tcial . 

2.  The  General  Log-linear  Model 

LOGLINEAR,  like  Its  predecessor  FREQ,  works  with 
the  following  general  formulation  of  the  log- 
linear  model  for  frequency  data.  Suppose  that 
there  are  J  "groups"  with  the  number  n^  of  obser¬ 
vations  per  group,  j  -  fixed  either  by 

the  sampling  scheme  or  by  conditioning.  Suppose 
further  that  there  are  I  levels  of  response, 
which  may  represent  crossed  or  nested  combina¬ 
tions  of  response  variables.  Let  n^^  >  0  denote 

observed  frequency  in  a  given  response-group 
combination,  m^j  ■  E  (n^j),  a  fixed  "weight", 

a  dummy  variable  taking  on  the  value  0  If  the 

i-th  response  in  the  j-th  group  is  a  structural 
zero  ”  0)  or  Is  to  be  fitted  perfectly 

taking  on  the  value  1. 
Finally,  let  x  ,  I  <  k  <  K,  denote  the  k-th 

1 J  k  —  - 

column  of  the  relevant  model  matrix,  where  K  is 
the  number  of  parameters  to  be  estimated.  The 
general  model  is 

K 

log  .  aj 

Special  cases  of  this  model  Include  the  follow¬ 
ing: 

I.  Log-linear  models  for  complete  contingency 
tables:  z..  *  w . ,  ■  1,  all  i  and  j,  J  »  1.  (All 

variables  are  responses  or  "dependent"  variables.) 

II.  Log-linear  models  for  incomplete  tables 

("quasi-log-linear  models"):  «  0  for  (i,j)es 

where  S  denotes  structurally  empty  response- 
group  levels,  ■  1  for  (l,j)  eS^,  J  »  1. 

HI.  Multinomial-response  models:  J  >  I  (the 
dichotomous  response  logit  model  is  obtained 
when  I  »  2) . 

IV.  Poisson  (or  rate)  models:  w^^  =  exposure 
(e.g.,  time  in  months)  for  rare  event  count  n^j. 
(Here,  rate  of  the  rare  event  for 

the  (i,))  combination,  and  we  will  usually  want 
to  take  J  =  I . ) 

Cell-by-cell  residuals,  examined 

by  comparing  them  to  the  estimated  asymptotic 
standard  deviation  generalized 

residuals  compare  ^  ~ 

"j,  =  0.  see 

Haherman  (1973,  1978).  Dispersion  In  multinomial 
responses  (marginal  and  conditional)  is  analyzed 
usin«  the  entropy  and  concent  rat  1  on  measures 
(Haherman,  1982).  The  program  gives  estimated 
parameter  values,  rhi-squared  statistics  (Pear¬ 
son  and  1 i ke 1 i hood- rat io) ,  the  va r i ance-rovar 1- 
anee  matrix  of  parameter  estimates  (from  the 
information  matrix),  correlations  obtained  from 
them,  and  a  variety  of  output  options. 


Estimation  is  by  the  Newton-Raphson  method, 
which  as  programmed  is  essentially  based  on 
iteratively  re-weighted  least  squares  (with 
weights  that  take  account  of  the  fixed  weights 
Wjj  and  the  approximations  for  m^^  obtained  from 


a  previous  cycle).  If  z 


0  for  some  response- 


group  combination,  or  If  “  5'^  n^^  =  0  (no 

observations  in  the  j-th  group) ,  the  procedure 
actually  eliminates  (gives  zero  weight  to)  the 
given  response-group  combination,  or  the  respon¬ 
ses  in  the  j-th  group,  respectively. 

All  analyses  of  contingency  tables  based  on 
frequcntlst  perspectives  are  plagued  by  the 
problem  of  sparse  data,  regardless  of  the  esti¬ 
mation  method  used  (weighted  least  squares  and 
ML  being  the  two  most  popular  methods).  It  is 
useful  to  distinguish  two  extreme  types  of 
sparse  data: 


Type  I.  One  or  more  of  the 

for  all  j,  j  «  1 ,  . . . ,  J. 
Type  11,  Some  n  «  0,  but  n 


0  but  >  0 


Type  11.  Some  n^  «  0,  but  >  0  for  all 
response-group  combinations  where  n^  >  0. 

These  conditions  are  specified  so  that  they 
pertain  to  the  multinomial-response  model,  but 
similar  conditions  apply  to  log-linear  models 
for  the  set  of  cell  frequencies,  in  which  case 
the  condition  n^  =  0  should  be  replaced  by  the 

condition  that  observed  values  of  some  suffi¬ 
cient  statistics  take  on  the  value  zero.  A 
third  case  would  have  some  n^  ■  0  (no  responses 


for  some  groups)  and  some  n. 


0  for  response- 


group  combinations  that  are  actually  observed 
(where  >0).  To  our  knowledge,  all  programs 

now  in  existence  give  zero  weight  to  responses 
in  a  void  group  (n^  »  0),  and  estimablli ty  may 

or  may  not  be  affected  by  this.  ML  procedures 
will  check  estimates  of  m^^  at  each  cycle  t  (say, 

m^j(t))  when  sparse  data  of  Type  I  occur.  If 
mjj(t)  =*  0  then  most  programs  will  give  zero 

weight  to  that  response-group  combination  In 
all  successive  Iterations.  (Curiously,  a  re¬ 
examination  of  the  offending  (l,j)  estimated 
count  In  cycles  after  the  first  one  where  the 
problem  occurs  does  not  seem  to  be  carried  out.) 
This  effective  deletion  (fitting  a  zero  expected 
count)  might  lead  to  a  rank  problem  for  the 


matrix  of  the  x . 


and  when  this  occurs  smart 


programs  will  delete  —  rather  arbitrarily  it 
turns  out  —  one  or  more  "columns"  of  the  model 
matrix.  Most  computat ional  problems  in  Ml.  fit¬ 
ting  arise  in  sparse  data  situations:  when 
Ojj  >  0  for  all  i  and  j,  there  are  no  problems 

at  all  theoretically  (Haherman,  1974),  and  com¬ 
putation  Is  straightforward.  In  our  opinion, 
the  chief  computational  problem  In  contingency 


ft 


table  analysis  based  on  ML  methods  is  diagnosing 
when  sparse  data  (Type  I  or  Type  II)  creates  an 
estimability  or  rank  problem.  As  we  shall  see 
beloWy  LOGLINEAR  can  be  Improved  on,  although 
what  it  currently  does  is  probably  better  than 
what  similar  programs  do.  Diagnostic  warnings 
concerning  such  problems,  at  least  intelligible 
ones,  are  virtually  nonexistent,  not  just  in  LOG- 
LINEAR  but  in  all  other  procedures  or  programs 
we  have  used. 

3.  Specifying  Models  In  LOGLINEAR 


To  illustrate  the  flexibility  of  LOGLINEAR,  con¬ 
sider  the  case  with  three  categorical  varlbles 
A,  B,  and  C.  Examples  of  models  In  each  general 
case  (I  -  IV)  described  above  will  be  given. 

These  examples  can  of  course  be  done  in  a  variety 
of  ways;  we  only  intend  to  convey  the  flavor  of 
model  i7>g  with  LOiJLINEAR  here. 


cells  (1,1,1)  and  (2,2,2)  are  structural  zeroes, 
or  are  to  be  fitted  perfectly  because  they  are 
"outliers".  Either  of  the  above  models  can  be 
examined  recognizing  the  set  S  of  structural 
zeroes;  this  is  done  by  specifying  the  z^^  of 

the  previous  section.  The  CWEIGHT  command  in 
LOGLINEAR  can  be  used  to  convey  this  information 
to  the  program.  If  Z  is  the  vector  with  entries 
(  “  0  for  structural  zeroes,  1  for  others), 

then  specifying 


CWEIGHT  »  Z/ 


prior  to  the  DESIGN  statement  will  cause  the 
program  to  analyze  a  quasl-log-llnear  model.  The 
quasi-independence  model  (In  three  dimensions) 
would  be  specified  by 

DESIGN  =  A,  B.  C/,  (3.3) 


Case  I.  Log-linear  models  for  contingency  tables. 
The  model  of  no  3-factor  interaction  (no  "second- 
order  interaction")  can  be  estimated  by  the 
following  two  commands: 

LOGLINEAR  A(l, 3)  B(l,3)  C(l,3)/  (3.1) 

DESIGN  =  A,  B.  C.  A  BY  B.  A  BY  C,  B  BY  C/ 

Each  variable  is  assumed  to  be  trichotomous.  The 
first  statement  says  that  there  is  "one  group" 

(J  «  1)  or  equivalently  that  each  variable  is  a 
response.  The  model  matrix  is  filled  with  two 
columns  for  the  main  effects  of  A,  by  including 
"A"  in  the  DESIGN  statement.  Four  columns  are 
used  for  each  interaction.  The  default  coding 
of  variable  contrasts  leads  to  parameter  esti¬ 
mates  that  correspond  to  deviations  from  means. 

In  Goodman's  (1970)  notation,  and  will 
be  estimated,  for  example,  and  X^  (which  is  not 
estimated)  Is  given  by  -  (Xj  +  X^).  (An  easy 

modification  of  the  program  would  be  to  include 
as  an  option  a  feature  that  would  calculate  the 
redundant  parameter  estimates  as  well  as  their 
standard  erros.) 

Now  suppose  that  the  levels  of  all  three  vari¬ 
ables  are  equally  spaced,  and  we  wisli  to  examine 
the  model  that  has  llnear-by-llncar  Interaction 
structure.  The  simplest  way  to  do  this  Is  to  use 
orthogonal  polynomials  to  code  each  variable; 
this  is  done  by  specifying  CONTRAST(A)  =  POLY¬ 
NOMIAL,  etc.  Then  the  DESIGN  statement  is  re¬ 
placed  by 


for  example,  and  quasi-log-llnear  models  analo¬ 
gous  to  those  In  (3.1)  or  (3.2)  can  be  analyzed 
as  well.  LOGLINEAR  calculates  parameter  esti¬ 
mates  for  quasi-log-linear  models,  unlike  some 
programs  based  on  the  Iterative-proportlonal- 
fltting  algorithm,  and  if  the  pattern  of  blanked 
out  cells  creates  rank  problems  In  the  model 
matrix,  the  program  will  recognize  the  diffi¬ 
culty  and  delete  one  or  more  parameters  from  the 
model.  This  should  alert  the  user  to  potential 
problems  In  interpreting  parameter  values  (con¬ 
trasts  of  log-estimated  counts).  It  essentially 
solves  the  problems  in  calculating  degrees  of 
freedom  for  chi-squared  statistics  when  such 
problems  arise.  (The  special  problem  of  dealing 
with  separable  subtables  created  by  particular 
patterns  of  structural  zeroes — see  Goodman  (1968) 
—  is  solved  without  difficulty.) 

Case  111.  Multinomial -response  (log! t- type) 
models.  Responses  are  distinguished  from  "fac¬ 
tors"  or  Independent  variables  with  the  ^  speci¬ 
fication  in  the  LOGLINEAR  command.  Suppose  A  is 
the  response  variable  and  that  B  and  C  are  fac¬ 
tors  with  joint  BC  levels  fixed  by  sampling 
design  or  conditionally  fixed  by  the  researcher's 
wish  to  examine  only  the  "effects"  of  B  and  C  on 
A.  Suppose  first  that  we  are  only  Interested  in 
the  first  two  levels  of  A;  perhaps  level  3  of  A 
represents  a  "don't  know"  response  or  censored 
observations.  The  additive  dichotomous  logit 
model  Is  specified  by; 

LOGLINEAR  A(1.2)  BY  B(l,3)  C(l,3)/ 

DESIGN  =  A.  A  BY  B,  A  BY  C/  (3.4) 


DESIGN  -  A,  B,  C.  A(l)  BY  B(l),  A(t)  BY  C(1), 
B(I)  BY  C(l),  (3.2) 

A(l)  BY  B(l)  BY  C(l)/ 

The  term  "A(I)"  denotes  the  linear  orthogonal 
contrast  for  A,  for  example.  This  model  has 
I  inear-by- 1 inear  2-factor  Interactions  and  Hnear- 
by-l inear-by-linear  3-factor  Interaction.  It  Is 
related  to  models  considered  in  Haberman  (1974), 
Goodman  (1984),  and  Clogg  (1982). 

Case  II,  Quasi-log-linear  models.  Suppose  that 


The  "BY"  fixes  the  n^,  j  =  1,  ...,  9,  where 

■  sample  total  with  B  =  1  and  C  =  1,  . . . ,  ng  - 


the  sample  total  with  B  =  3  and  C 


This 


command  essentially  determines  the  values  In 

(2,1),  A  model  with  A  trichotomous  (perhaps  now 
Including  the  observations  censored  in  the  pre¬ 
vious  model)  Is  obtained  by  replacing  "A  (1,2)" 
with  "A(l,3)". 


Now  suppose  that  level  3  of  A  represents  a  "don't 
know"  response.  The  researcher  wants  to  examine 
contrasts  of  A*1  versus  A*2  taking  account  of 
the  censoring  that  takes  place  in  the  model  of 

(3.4) .  A  natural  way  to  do  this  exploits  the 
"special"  contrast  specification: 

CONTRi\STCA)  =  SPECIAL  (3*1,  1  -1  0,  1  1  -2)/  (3.5) 

The  contrast  (1,  -1,  0)  Is  of  special  Interest, 
and  the  contrast  (1,  1,  -2)  can  be  used  to  exam¬ 
ine  tlie  difference  between  non-censored  and  cen¬ 
sored  observations.  Now  suppose  that  we  wish  to 
examine  linear  effects  of  B  and  C  as  in  (3.2). 

The  appropriate  model  will  be  estimated  by  the 
following  commands ; 

LOCLINEAR  A(l,3)  BY  6(1,3)  C(l,3)/ 

DESIGN  -  A.  A  BY  6(1),  A  BY  C(l)/  (3.6) 

Case  tv.  Poisson  models.  Now  suppose  that  A,  B, 
and  C  denote  risk  factors,  and  the  frequencies 
In  the  cross-classification  of  these  risk  factors 
denote  event  counts  (e.g.,  deaths).  Suppose 
further  that  the  cell-by-cell  expostires  (e.g., 
person  months)  are  collected  In  a  vector  W.  The 
command  "CWEIGHT  *  W"  adjusts  the  cell  counts 
for  the  exposures.  If  each  factor  Is  quantita¬ 
tive  with  equal  spacing,  a  model  of  interest 
could  be; 

LOGLINEAR  A(l,3)  B(l,3)  C(l,3)/ 

CWEIGHT  =  W/ 

C;nNTRAST(A)  POLYNOMIAL/ 

CUNTRAST(B)  =  POLYNOMIAL/  (3.7) 

CONTRAST(C)  *•  POLYNOMIAL/ 

DESIGN  «  A(l) .  B(l) ,  C(l)/ 

If  m  ^  Is  the  expected  count  in  cell  (s,t,u)  and 
stu  ^ 

'^stu  ccrrespondlng  exposure  In  the  A  x  B 

X  C  table,  the  model  estimated  above  is  equiva¬ 
lent  to: 

=  P  +  +  Pjt  +  P3U. 

an  additive  log-rnte  model  with  linear  effects 
of  each  risk  factor.  It  is  very  difficult  to 
estimate  such  a  rate  model  using  the  IFF  algo¬ 
rithm  advocated  In  Laird  and  Olivier  (1981).  But 
as  Laird  and  Olivier  note’,  Poisson  log-linear 
models  are  closely  related  to  the  familiar  pro¬ 
portional-hazards  model. 

Covar  tates.  An  attractive  feature  of  i.OGLlNEAR 
is  the  covarlate  option.  If  X  Is  a  quantitative 
covariate  or  dummy  variable,  It  may  be  added  to 
the  model  by  using  a  WITH  specification.  For 
example,  suppose  we  wish  to  examine  the  linear 
effect  of  X  on  the  log-odds  that  A  *  1  instead 
of  A  =  2.  A  modification  of  the  model  given  in 

(3.4)  might  be  as  follows: 

LOGLINEAR  A(l,2)  BY  B(l.3)  C(l,3)  WITH  X/ 
DESIGN  =  A,  A  BY  B,  A  BY  C,  A  BY  X/  (3.  ) 

4 .  Some  Simple  Diagnostic  Tests 


Maximum  likelihood  or  other  estimation  methods 
derived  from  frequentlst  theory  can  be  difficult 
to  apply  to  sparse  data.  Table  1  gives  three 
simple  examples  of  sparse  data  in  2x2x2  contin¬ 
gency  tables.  These  data  can  be  studied  either 
in  terms  of  logit  models  (C  the  response  and  A 
and  B  the  factors)  or  in  terms  cf  the  equivalent 
log-Hnear  models.  MLE's  do  not  exist  for  the 
additive  logit  model  (model  of  no  3-factor  inter¬ 
action)  applied  to  Table  la.  MLE's  do  not  exist 
for  the  saturated  logit  (or  log-linear)  model 
applied  to  Table  Ic.  For  Table  Ib  the  theory  Is 
less  clearcut;  the  zero  counts  for  responses  on 
C  when  A=B=1  amount  to  giving  zero  weight  to 
that  response  pattern  in  a  logit  model.  Because 
of  this  the  main  effects  of  A  and  B  on  the  logits 
of  C  are  not  simultaneously  estimable.  We  treat 
all  three  cases  with  the  corresponding  models 
discussed  above  as  nonexistence  problems,  how¬ 
ever,  recognizing  that  nonexistence  might  not  be 
the  preferred  term  for  Table  lb. 

Clogg,  Rubin,  and  Weldman  (1985)  use  these  three 
contingency  tables  to  compare  eight  popular  logit 
regression  or  log-linear  analysis  programs.  The 
LOGLINEAR  procedure  in  SPSS’^  was  one  of  the  pro¬ 
grams  considered.  The  following  discussion 
Indicates  that  there  are  some  problems  with  LOG- 
LINEAR  at  least  in  the  area  of  providing  diag¬ 
nostic  information. 

For  Table  la  and  using  tl:e  additive  logit  model 
(model  of  no  3-factor  interaction),  LOGLINEAR 
prints  chl-squared  values  of  0.00,  2  degrees  of 
freedom,  and  two  zero  fitted  freque  .cies  corre¬ 
sponding  to  the  sampling  zeroes.  From  Haberman 
(1974a)  these  are  the  correct  answers.  This 
model  would  have  1  df  If  no  more  than  one  sam¬ 
pling  zero  occurs  (or  if  all  counts  are  positive), 
and  most  researchers  would  like  to  know  why  the 
correct  answer  is  df  =  2.  Neither  the  program 
output  nor  the  documentation  provide  any  help  on 
this  matter.  The  two  main  effects  are  not  simul¬ 
taneously  identifiable:  the  LOGLINEAR  fixup 
deletes  the  B-C  interaction  term  (for  B's  effect 
on  C) ,  but  of  course  the  A-C  Interaction  term 
could  have  been  deleted  with  equal  justification. 
It  is  only  because  the  B-C  interaction  infor¬ 
mation  was  stored  in  the  "last"  entry  In  the 
relevant  arrays  or  matrices  that  this  parameter 
value  was  deleted.  (Incidentally,  LOGLINEAR 
prints  for  both  parameter  values  and  stan¬ 

dard  errors  for  deleted  parameter  values.)  The 
only  diagnostic  message  given  by  the  program  is 
"ML  did  not  converge,"  but  this  diagnostic  is 
misleading.  The  program  did  give  the  correct — 
and  exact — ML  solution  for  the  expected  fre¬ 
quencies,  which  in  this  case  are  merely  the 
observed  frequencies.  Researchers  might  con¬ 
clude  tljat  the  A-C  interaction  was  estimated 
appropriately  and  that  the  B-G  Interaction  is 
zero,  but  of  course  such  an  Inference  would  be 
incorrect.  The  estimated  value  of  the  A-C  inter¬ 
action  does  n5^t  refer  to  the  contrast  of  log- 
frequencies  that  is  used  to  define  the  original 


model.  The  point  Is  that  the  user  Is  left  In 
the  dark  concerning  what  the  program  dld>  what 
the  results  mean,  and  what  could  be  done  to 
remedy  the  problem. 

For  Table  Ic  using  the  saturated  model,  the  out¬ 
put  is  again  somewhat  misleading.  The  MLE*s  do 
not  exist  for  the  saturated  model  when  there  are 
sampling  zeroes,  so  some  indication  of  this 
would  be  expected.  Here  is  what  LOGLINEAR  gives. 
The  program  gives  the  correct  chi-squared  value 
(0.00)  and  the  correct  df  (df  -  0).  But  even 
though  the  MLE's  of  the  parameters  do  not  exist, 
LOGLINEAR  print  outs  estimates  for  them  along 
with  standard  errors.  The  standard  errors  are 
large  and  the  parameter  values  are  nonsensical, 
so  some  researchers  would  recognize  that  there 
Is  a  kind  of  identif lability  problem.  But  no 
warning  messages  or  diagnostics  are  printed. 

The  additive  logit  model  (model  of  no  3-factor 
Interaction)  was  applied  to  Table  lb.  LOGLINEAR 
gives  chi-squared  values  of  0.00,  which  is 
correct.  But  most  ML  advocates  would  say  that 
the  model  applied  to  Table  lb  is  equivalent  to 
blanking  out  the  two  sampling  zeroes  because  the 
ML  solution  will  estimate  these  frequencies  as 
zeroes.  The  model  would  be  redefined  and  repa¬ 
rameterized  for  the  remaining  six  cells.  When 
this  is  done,  the  additive  logit  model  Is 
saturated  relative  to  these  six  cells,  so  df  =  0 
should  be  reported.  Nevertheless,  LOGLINEAR 
gives  df  *  1.  It  is  curious  that  a  chi-squared 
value  that  has  to  be  zero  for  such  a  sparse  table 
would  be  said  to  have  one  degree  of  freedom.  And 
once  the  two  sampling  zeroes  are  removed,  the 
parameter  values  that  would  be  calculated  no 
longer  refer  to  standard  contrasts  of  the  logits. 
LOGLINEAR  nonetheless  prints  parameter  values 
and  standard  errors  with  no  warning  that  they  do 
not  refer  to  the  contrasts  originally  specified 
In  formulating  the  model. 

To  summarize,  LOGLINEAR  does  not  do  a  good  job 
in  reporting  results  obtained  from  elementary 
examples  with  nonexistent  MLE's.  Diagnostics 
are  virtually  nonexistent.  Users  who  suspect 
problems  In  their  output  (suspicious  parameter 
values  and/or  standard  errors,  or  unanticipated 
degrees  of  freedom)  will  have  to  turn  to  an 
experienced  consultant  to  answer  their  questions. 

To  put  this  evaluation  in  proper  perspective, 
however.  It  should  be  noted  that  LOGLINEAR  per¬ 
formed  at  least  as  well  as  the  seven  other  pro¬ 
grams  examined  In  Clogg,  Rubin,  and  Weldman 
(1985).  More  Internal  checks  for  consistency 
and  more  intelligible  diagnostic  messages  are 
required  in  all  of  these  programs. 

5.  Suggestions  for  Improvement 

Another  procedure  in  SPSS^  can  be  used  for 
analysis  of  categorical  data  too:  HILOGLINEAR, 
a  program  based  on  the  IPF  ( 1 terat i ve-propor- 
tional-f Ittlng)  algorithm.  The  "HI”  Is  not  a 
salutation,  but  stands  for  hierarchical  models; 
this  procedure  can  calculate  ML  fits  for  hierarchi¬ 


cal  models  having  observed  marginals  as  sufficient 
statistics,  HILOGLINEAR  was  evidently  prepared 
to  serve  as  an  exploratory  screening  procedure 
that  could  be  used  to  select  models  for  further 
study  In  LOGLINEAR.  At  present,  however,  HILOG¬ 
LINEAR  appears  to  be  quite  preliminary  and  we 
cannot  recommend  it.  The  procedure  does  not 
calculate  parameter  estimates  for  unsaturated 
models;  because  of  this,  the  procedure  can  never 
stand  alone  even  if  the  researcher  is  interested 
in  the  kinds  of  models  that  can  be  considered 
with  the  procedure.  The  program  does  not  calcu¬ 
late  degrees  of  freedom  correctly  for  incomplete 
tables:  the  example  in  the  SPSS^  documentation 
(one  of  the  classic  examples — see  Goodman  (1968) 
and  Clogg  (1985))  reports  Incorrect  df  because 
It  does  not  recognize  separable  subtables.  There 
are  both  forward  selection  and  backward  elimi¬ 
nation  model  search  options.  A  general  recommen¬ 
dation  is  that  HILOGLINEAR  should  be  greatly 
Improved  and  expanded;  the  P4F  program  in  BMD 
provides  a  good  example  of  what  should  be  Incor¬ 
porated  . 

We  have  the  following  recommendations  for  im¬ 
proving  LOGLINEAR,  most  of  which  can  be  imple¬ 
mented  easily: 

1.  Improve  the  documentation.  How  covarlates 
may  or  may  not  be  used  is  unclear  from  the 
published  report.  There  are  no  examples 
with  continuous  covarlates.  There  are  few 
references  to  the  literature.  There  Is 
little  indication  that  the  CWEIGHT  command 
can  be  used  to  adjust  Poisson  counts  for 
exposures,  no  indication  that  the  program 
provides  a  flexible  procedure  for  analysis 
of  rates. 

2.  Output:  multinomial-response  models  are 
alternatives  to  discriminant  analysis.  Since 
multinomial-response  models  (logit-type 
models)  are  convincing  alternatives  to  linear 
discriminant  analysis  (Press  and  Wilson, 

1978),  It  would  be  helpful  if  output  from 
such  models  could  be  arranged  to  facilitate 
practical  discriminant  analysis.  This  would 
Involve  obtaining  the  predicted  proportions 
In  the  I  response  levels  for  each  of  the  J 
groups  and  assessing  their  variability  (pre¬ 
diction  Intervals)  under  the  model.  This  is 
easy  to  do.  Output  from  programs  dealing 
exclusively  with  dichotomous  logistic  regres¬ 
sion  models  (SAS:  LOGIST  or  PREDICT,  BMD: 

PLR)  already  facilitates  such  analysis. 

3.  Input-Output;  linear  contrasts  of  parameters 
and  the  associated  variance-covariance  matrix. 
If  P  is  the  vector  of  parameter  estimates, 
linear  contrasts  of  the  form  L  P  can  be  used 
to  advantage.  Such  linear  contrasts  can  be 
tested  using  Wald  statistics.  Since  the 
variance  of  B  is  already  calculated,  this 
creates  no  special  problem.  Various  speci¬ 
fications  of  L  could  be  used  to  examine  how 

a  given  model  might  be  simplified  (explora¬ 
tory  use),  to  examine  collapslblllty  of 
categories  (Suman,  1985),  and  to  perform 
simultaneous  tests  on  sets  of  parameters 


without  resorting  to  the  comparison  of 
nested  models  and  likellhood-ratlo  tests. 

A,  Output:  variances  of  measures  of  associa¬ 
tion.  Haberman  (1982)  derived  the  approxi¬ 
mate  distributions  for  both  entropy  and 
concentration  measures  of  association.  This 
information  should  be  added  to  LOGLINEAR. 

5.  Input;  adding  fractional  counts  to  the  data. 
It  is  easy  to  add  the  same  constant  to  all 
cell  counts  (e.g.,  1^)  *  ^nd  there  is  some 
justification  for  doing  so  when  saturated 
models  are  considered  (Goodman,  1970). 

Adding  constants  to  the  frequencies  can  be 
Interpreted  from  a  Bayesian  perspective;  the 
prior  is  either  beta  or  Dirichlet.  Adding 
the  same  constant  to  all  counts  shrinks  the 
data  toward  equiprobability .  In  a  logit 
model  this  shrinks  all  parameter  values, 
including  the  constant,  toward  7ero.  More 
flexible  priors  that  are  model  based  are 
discussed  in  Clogg,  Rubin,  and  Weldman 
(1985).  Simple  changes  in  LOGLINEAR  would 
allow  implementation  of  these.  (The  most 
obvious  choice  in  a  logit  model  Is  to  add 
constants  to  "successes"  and  "f«'illures"  in 
proportion  to  the  marginal  distribution  of 
the  response.) 

6.  Programming:  internal  checks.  As  the 
examples  in  the  previous  section  show,  there 
are  problems  when  even  simple  tables  with 
sparse  data  are  analyzed.  The  program  does 
not  seem  to  "correct"  for  zero  observed 
group  totals  In  multinomial-response  models, 
or  at  least  does  not  do  so  all  of  the  time. 
For  tables  of  high  dimension  there  should 

be  additional  checks  on  a  cycle-to-cycle 
basis  for  estlmability .  We  believe,  but 
cannot  prove,  that  it  is  not  sufficient  in 
general  to  let  conclusions  reached  in  one 
cycle  about  estlmability  dictate  model  re¬ 
definition  (parameter  deletion)  In  all  sub¬ 
sequent  cycles. 

7.  Diagnostics;  warning  messages,  cautionary 
remarks .  The  only  warning  we  have  seen  In 
using  LOGLINEAR  is  "ML  did  not  converge." 
This  is  not  informative  enough.  There  arc 
many  other  messages  that  should  be  given, 
particularly  when  sparse  data  problems  arise. 
Some  information  about  possible  rank  pro¬ 
blems  In  the  Information  matrix  would  be 
helpful  as  well.  (Perhaps  such  diagnostics 
could  be  borrowed  from  those  in  wide  use  for 
the  X'X  matrix  In  regression.)  These  pro¬ 
blems  are  Ignored  in  the  technical  documen¬ 
tation. 

In  spite  of  the  criticisms  noted  above,  LOG- 
LINEAR  is  a  good  program  for  the  analysis  of  con¬ 
tingency  tables.  In  our  opinion,  researchers 
who  have  access  to  both  LOGLINEAR  and  BMD’s  pro¬ 
gram  P4F  will  be  able  to  deal  with  most  contin¬ 
gency  table  problems  that  are  likely  to  arise  In 
practice. 


Table 

1. 

Three  2x2x2  Contingency  Tables 

with 

la. 

Sampling  Zeroes 

lb. 

Ic. 

C 

C 
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2 

1  2 

1  2 

11  0 

3 
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0  3 

21  9 

4 

9  4 

9  4 

12  6 

3 

6  3 

6  3 

22  5 

0 

5  3 

4  1 

n  -  30 
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Parametric  models  for  the  multinomial  distribution  are  considered  within  the  larger 
family  of  regular  exponential  family  models.  This  allows  a  unified  approach  to  fitting 
multinomial  regression  models  using  algorithms  based  on  iteratively  reweighted  least 
squares.  A  useful  example  is  provided  by  the  family  of  continuation  ratio  models. 
Generalized  linear  models  are  considered  as  an  Important  special  case  of  the 
exponential  family  which  provides  an  approach  to  categorical  data  based  on  log-linear 
mode  1 s . 


1.  INTRODUCTION 


2.  MULTINOMIAL  REGRESSION  MODELS 


We  consider  the  theory  and  practice  of  maximum 
likelihood  estimation  for  multinomial  regression 
models.  These  are  parametric  models  for  data 
obtained  by  measuring  a  categorical  response  in 
the  presence  of  possibly  multiple  explanatory 
variables.  The  appropriate  sampling  scheme  is 
what  is  known  as  product  multinomial  sampling. 

We  discuss  the  exponential  family  formulation  of 
such  models  and  review  the  fitting  of  general 
exponential  family  models  using  iteratively 
reweighted  least  squares.  The  discussion  is 
based  on  the  approach  of  Jennrich  and  Moore, 
(1975)  which  deserves  to  be  more  widely 
recognized.  We  Include  a  slightly  simpler 
derivation  of  their  results,  which  basically 
consist  of  a  formal  identification  of  the 
maximum  likelihood  problem  with  a  weighted 
nonlinear  least  squares  problem.  This  yields  an 
iteratively  reweighted  Gauss-Newton  algorithm 
for  the  computation  of  maximum  likelihood 
estimates  and  asymptotic  standard  errors.  We 
illustrate  the  theory  with  an  example  using 
continuation  ratio  models  for  ordinal  data 
(Fienberg,  1980.)  We  also  discuss  the 
generalized  linear  models  of  Nelder  and 
Wedderburn  (1972)  and  the  resulting  fitting 
algorithm,  which  is  also  based  on  iteratively 
reweighted  least  squares.  This  subclass  has  the 
advantage  of  being  much  more  analogous  to 
ordinary  linear  models  and  Is  the  basis  for  the 
GLIM  statistical  computing  system.  Finally,  we 
consider  the  analysis  of  categorical  data  using 
GLIM,  which  rests  on  the  assumption  of  Poisson 
sampling. 


We  consider  a  random  n-dimenslonal  vector  Y 
having  a  multinomial  distribution  as  a  member  of 
a  regular  exponential  family.  To  this  end  we 
write  the  density  of  Y  as 


P(y.n)  =  pr|Yj=yj . Y^=y^|  = 

exp|ln,y.-y.ln(Ee'’j)+ln(y‘)l, 

1  ’  ’  J  y 


where  y.  =  and  the  multinomial  probabilities 
can  be  computed  from  the  natural  parameters  as 

7t.  =  e'^i/Ee'^j. 

’  j 

We  assume  that  the  n-dimenslonal  natural 
parameters  n  depends  on  p5  n  parameters  8  and 
write  Eg(V)  =  m(D)  and  Varg(Y)  =  E(9).  For  the 
multinomial  distribution  these  are  m(0)  =  y.Ti(e) 
and  1(9}  =  y.  |D(Ti)-im '  | .  Differentiating  under 
the  Integral  sign  we  have  the  standard  results 

U  =  -3d/3n  and  Z  =  -3^d/3n^. 


The  likelihood  equations  for  the  regular 
exponential  family  likelihood  are 


s(9) 


3ji' 

39 


39 


=  0  . 


.yuv.'- 


■v.- 
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To  transform  these  equations  we  first  apply  the 
chain  rule  to  the  previous  expressions  for  the 
mean  and  variance  functions  to  obtain 


M'  .  JJl'  u 

39  30 


3  u  _  3 _  /  3d  V  ~  3  n 

39  ■  39  '■  3"'  ■  an  • 


3ri' 


39 


Now  let  Z’  be  a  symmetric  generalized  inverse 
of  Z,  satisfying  ZZ'Z  =  £.  In  the  multinomial 
case  we  have  Z'  =  (y.)'^D  ^(n)-  Then  since 
a'Z=0  implies  Var(a' (Y-u))=0,  we  have  ZZ'(Y-m)= 
Y-u,  (Y-u  is  in  the  range  of  Z)  with  probability 
one.  Combining  these  results  we  may  write  the 
likelihood  equations  as 


0  =  s(9)  =  |l'(y-M 

=  1^9  Tr(y-u) 


=  'fa  . 

These  are  the  normal  equations  for  the  nonlinear 
least  squares  problem:  minimize 
(y-u) ' Z'(y-u) .  It  also  follows  from  these 
results  that 


which  is  an  iteratively  reweighted  Gauss-Newton 
algorithm  for  the  nonlinear  least  squares 
problem  .  Asymptotic  standard  errors,  obtained 
by  inverting  the  .information  matrix,  may  be 
computed  from  the  usual  standard  errors  given  by 
the  Gauss-Newton  algorithm  if  we  omit  the 
residual  mean  square 


= 


(y-M)'Z'(y-u)/(n-p). 


For  the  multinomial  distribution  the  numerator 
is  just  the  Pearson  chi-square  statistic  for 
goodness-of-f it.  Recent  work  on  quasi- 
likelihood  models  (McCullagh  and  Nelder,  1983) 
suggests  that  if  o^  is  not  reasonably  close  to 
one,  e.g.,  is  significantly  larger  than  one, 
then  the  asymptotic  standard  errors  should  be 
corrected  by  multyiplyihg  by  o. 


In  practice  one  can  fit  general  exponential 
family  models  using  any  weighted  regression 
program  which  can  be  iterated  after 
recomputation  of  the  weights.  This  can  be  done 
for  example  in  MINITAB.  Nonlinear  regression 
programs  which  impliment  the  Gauss-Newton 
algorithm  are  easier  to  Use  provided  they  allow 
Iterative  computation  of  the  weights.  Such 
programs  are  available  in  BMDP,  SAS  and  GENSTAT. 
To  use  such  a  program  one  must  specify  the 
quantities  p,  3u/39  and  Z  ‘  (means,  derivatives 
and  weights).  We  used  the  program  BM0P3R 
(Dixon  et  a1.,  1981)  which  also  allows  the  use 
of  a  loss  function  as  a  termination  criterion. 
The  natural  loss  function  is  the  deviance. 


G  =  -215  YiiogCi'i)  -  y^iogCyi/y-) 


fc-. 


■  .1 


'1' 


zz”  ^  ^ 

39  39’ 


allowing  us  to  write  the  information  matrix  as 
1(9)  =  Var  s{9)  =  Z'  Z 


39 


39 


39 


39  • 


Therefore  the  Fisher  scoring  algorithm  becomes 


A(0) 


r'(9)  s(d)  =  (|^' 


1^)'^  1^'  J'(y-M)- 


39 


where  n j  are  the  estimated  probabilities.  This 
is  the  likelihood  ratio  statistic,  with  n-l-p 
d.f. ,  of  the  current  to  the  saturated  model 
which  estimates  the  multinomial  probabilities  by 
the  observed  proportions. 


3. 


AN  EXAMPLE:  CONTINUATION  RATIO  MODELS 


If  the  ordering  of  the  categories  l,...,n  is  not 
arbitrary  (or  even  if  it  is),  the  n-1 
conditional  probabilities  known  as  continuation 
ratios  (Fienberg,  1980)  are  defined  as 


p,  =  Zm.  /Zx, 

j>1  j>r 


t: 


f  J  TTJ  '■V^J  »■.  VJ  ^.' 


."■'W  P*W!W 
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Continuation  ratio  models  are  just  logit  models 
for  these  conditional  probabilities.  In  the 
framework  of  product  multinomial  sampling  we 
hare  multinomial  probabilities 


»^jllSlSR,  lSj=C).  satisfying 


for  ISiSR.  and  continuation  ratios  Py.  The 
model  is  specified  by  writing  the  logits, 
as  functions  of  the  explanatory  variables  and 
parameters  9.  For  example  Fienberg  (1980) 
considers  data  on  3  levels  of  educational 
attainment.  The  explanatory  variables  are  age 
(2  levels),  race  (2  levels)  and  father's 
education  (4  levels).  The  data  consist  of 
counts  of  the  three  levels  of  the  response 
variable  for  each  combination  of  the  three 
explanatory  variables,  for  a  total  of  16 
trinomials  (32  d.f.).  Fienberg  (1980) 
considers,  among  others,  an  18  parameter  model 
having  different  parameters  for  each  of  the  two 
continuation  ratios.  The  model  includes  main 
effects  for  each  of  the  three  factors,  as  well 
as  an  interaction  between  father's  education  and 
race.  In  an  obvious  notation  the  model  is  given 
by 


arfc 


=  +  8^  +  Yf  +  (8Y)rf  . 


where  c(=l,2)  denotes  continuation  ratio  and  the 
appropriate  constraints  (e.g.,  Oj  =  0)  are 
imposed  for  identif iabi 1 1 ty . 


As  Fienberg  (1980)  points  out,  this  model  can  be 
fitted  as  a  separate  pair  of  logit  models  for 
the  two  conditional  probabilities.  We  fitted  the 
entire  model  using  the  nonlinear  regression 
program  BMDP3R.  The  means  (probabilities), 
derivatives,  weights  and  loss  function  are 
supplied  to  the  program  in  a  FORTRAN  subroutine 
(Figure  1).  The  multinomial  probabilities  Xj  may 
be  computed  for  i>2  (xpl-Pj)  from  the  relation 

i-1 


=  (l-P.) 


(Pn  =0)  , 
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or  more  easily  from  the  recursion 


"1 


(1-Pl)  tt 


1-1 


'i-1 


■i-l 


which  can  also  be  used  for  the  computation  of 
derivatives.  The  program  was  run  with  initial 
values  of  0.1  for  all  parameters  and  converged 

r2 


in  8  Iterations  to  a  G  of  18.6,  which  agrees 
with  the  value  of  18.5  given  by  Fienberg  (1980) 
in  Table  6-11.  This  example  is  also  discussed 
in  Cox  (1985) . 


There  is  of  course  no  reason  why  one  should  have 
different  parameters  for  different  continuation 
ratios,  nor  need  the  model  for  the  logits  be 
linear.  Consider,  for  example,  the  following 
multiplicative  interaction  model  for  an  RxC 
table  under  product  multinomial  sampling. 


+  Y,  +  e,6j 


1=1,... ,R; 


3=1 . C-l. 


subject  to  the  identif iabi 1 ity  constraints 
Bj=0,  6j=0,  for  a  total  of  2R+2C-4  parameters. 
This  model  cannot  be  fitted  as  a  series  of  logit 
models  for  the  continuation  ratios.  As  an 
illustration  we  consider  a  4x4  table  discussed 
by  Cox  and  Chuang  (1984).  The  data  consist  of 
ratings  (poor,  fair,  good,  very  good  to 
excellent)  of  four  analgesic  drugs.  The  data 
are  given  in  Figure  2.  The  y  snd  9  parameters 
now  model  differences  between  drugs,  while  the  8 
and  6  parameters  model  differences  between 
continuation  ratios.  Figures  1-2  display  the 
FORTRAN  and  BMDP  programs  for  fitting  a  model 
with  eight  constraints,  (6  d.f.)  which 
essentially  identify  the  first  two  and  the  last 
two  drugs.  Here  again  convergence  was  fairly 
rapid  (Figure  3)  with  initial  values  taken  from 
a  previous,  unconstrained  fit.  The  deviance 
=  9.58  with  6  d.f.,  as  well  as  parameter 
estimates  and  asymptotic  standard  errors  are 
given  in  Figure  3.  Observed  and  predicted 
proportions  (Figure  4)  can  be  extracted  for 
the  comoutation  of  standardized  residuals. 


■V, 

•v.- 
•v  ■ 


-■ •» '^r -a f\  ' i.' K* ■'." I' v* u.' jj k* hj .wx>v vk^v.v 


Cox 
Page  4 


v»mrviiiwir»^rT^ 

w* 


I 


tN 


4.  GENERALIZED  LINEAR  MODELS  -  A  SPECIAL  CASE 


This  class  of  models  Is  Important  because  of  Us 
useful  similarities  to  ordinary  linear  models 
and  because  It  forms  the  basis  of  the  GLIM 
statistical  system  (Baker  and  Nelder,  1978). 

Two  additional  assumptions  are  required  for  a 
generalized  linear  model.  The  first  is  that  the 
components  of  the  random  vector  Y  are 
Independent.  This  means  that  we  can  factor  the 
likelihood  so  that  the  function 

d(n)  =  Zd,(ni), 
and  Uj=  -3dj/3n^. 

Var(Y,)  =  -3^d,/3n,  =  3u,/3n,. 


The  Fisher  scoring  algorithm  can  be  rewritten  as 
ie  =  (X'WX)'^X'W(z-<))  . 
or  since  Xi=X9,  as 

e+ie  =  (x'wx)''x‘wz. 

Thus  each  iteration  yields  the  next 
approximation,  rather  than  the  Increment.  Again 
omitting  the  residual  mean  square  the  variances 
of  the  least  square  estimates  are  also  correct 
since  X'WX  =  (3M'/3e)E'^(3u/30)  . 


im 

fv: 


V 


and  E  is  a  nonsingular,  diagonal  matrix.  The 
second  assumption  is  that  n^=f(<ij),  where  f  Is  a 
monotone  link  function  and  ip  is  the  linear 
predictor,  i=X0,  where  X  Is  a  matrix  of 
predictor  variables.  Thus  on  the  appropriate 
scale  we  are  dealing  with  a  linear  regression 
problem  although  not  with  the  usual  error 
structure. 


Nelder  and  Wedderburn  (1972)  develop  an 
iteratively  reweighted  least  squares  algorithm 
for  fitting  generalized  linear  models  by 
defining  a  working  dependent  variable 


and  a  diagonal  matrix  of  weights 


With  these  definitions  the  likelihood  equations 
can  be  written  as  X'Wz  =  X'WX9,  which  are  the 
normal  equations  for  the  linear  least  squares 
problem:  minimize  (z-Xe) 'W(z-X9)  . 


w'*  . 


Because  of  the  assumption  of  statistical 
Independence  the  natural  error  structure  in  GLIM 
for  categorical  data  is  the  Poisson 
distribution.  The  connection  between  Poisson 
and  multinomial  models  Is  well  known  (HcCullagh 
and  Nelder,  1983).  An  approach  using  the 
multinomial  distribution  Is  possible  by  using 
composite  link  functions  (Thompson  and  Baker, 
1981)  although  this  is  much  more  Involved  and, 
we  believe,  more  awkward  than  the  method  of 
fitting  expected  values  discussed  previously. 
Examples  of  log-linear  models  with  Poisson 
errors  may  be  found  In  Nelder  and  Wedderburn 
(1972)  and  HcCullagh  and  Nelder  (1983). 
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Figure  1  FORTRAN  program  for  the  computation  of  means  (probabil ities) 
derivatives  and  weights  (variances)  for  a  14  parameter 
continuation  ratio  model.  The  program  is  used  with  BM0P3R. 
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GRAPHICAL  ANALYSIS  OF  PROPORTIONAL  POISSON  RATES 


Brian  S.  YandeU 


Universitv  of  Wisconsin  -  Madison 


We  prescni  graphical  tools  for  examining  proporiionalin’  of  a  Poisson  process  rate  lo  a 
baseline  from  a  group  of  simitar  processes.  We  examine  smooth  deviations  from  this  base¬ 
line  using  smoothing  splines  for  general  linear  models.  An  example  of  egg-laying  rates  for 
leafhoppers  is  examined  in  some  detail. 


t.  Introduction 

This  paper  concerns  inference  (or  nonsiationar)  Poisson 
rate^  which  are  "almost"  proportional  to  a  common  baseline.  It 
proNides  a  means  for  "pre-smoothing"  rate  estimates  to  avoid 
some  of  the  common  problems  of  estimating  functions  with  large 
curvature  at  certain  places. 

One  may  believe  that  a  group  of  female  potato  teafhoppers 
in  the  same  fluctuating  temperature  regime  (Hogg.  19S4)  would 
CMposil  at  rates  which  rose  and  (ell  at  roughiv  the  same  lime. 
That  is.  one  would  suppose  that  the  oviposinon  rates  would  be 
pio(Ktriional  to  a  common  baseline  rate  One  could  estimate  this 
baseline  rate,  and  then  estimate  the  individual  curves  bv  simpiv 
determining  the  constant  ot  prctporiionaliiv .  as  v»as  done  b' 
hartos^vnski  et  al.  ( 1981 ).  However,  one  might  want  to  examine 
the  proportionalitv  as  a  function  of  time  to  determine  whether  or 
noi  II  is  i  onstani 

V'e  piopc'se  a  meiluKl  to  esnmnte  this  ptoiMimonalitv  over 
lime  Mthouph  manv  approaches  are  possthle  (Clevenson  and 
Zidek.  1977;  Hasiie  and  libshiiani.  1984).  \»,p  develop  our  esii 
mators  in  the  franiev»ork  of  jvenali/ed  mavimum  likelthcKK)  (Good 
and  (iaskins.  197):  O'Sullivan,  Yandell.  and  Ravnor.Jr.. 
1984) 


2.1.  Log  LikeliluK>d 

The  likelihood  can  be  written  down  and  decomposed  into 
pieces  so  that,  subject  to  con.strainis.  we  can  have  a  separate 
likelihood  lor  each  individual  proportionality  term.  The  overall 
log  likelihood 

-  IS*, .I'o?' 

^  ,  ,  , 

can  be  reexpressed  as  the  sum  of 

L(B")=  >.^IIog()\/r)-e"((,))  (2  -ij 


—  VV  ),,Il0g(r)  .  1-0  (,  )| 


Ihioughoui  this  paper.  mdttaies  sum  ovet  llie  intended 

tmlev  N(>te  that  (T  is  a  |‘ois«.on  j>fnHl»/ed  likelilKKvd.  and 
i2  4»  IV  muhinomiat  )sfna)i/ed  hkelihocKl  tondUKuial  on  > ^ . 
In  ottiei  woid'.  ),.  is  hiiioimal  i).  ,.«,(/,)  r).  This  suggests 


splitunp  <2  4 1  ini<' 


.  is  hiiioimal  i).  ,.M, (/,)  r).  This  suggests 
trims  of  the  form 


and  pro^xTOonalitv  trims  are  developed  in  Secitem  .7  Section  4 
hiieflv  presents  diagnosiK  tools  The  methods  are  applied  to 
lealhojtpei  oviposiiion  data  in  Section  5 

2.  Froporlional  Poisson  Rules 

An  individual  leafbopper  /  ,  » -  I.  .r.  niav  lav  eggs 
at  time /,<  </„.  The  count  is  assumed  Poisson  with 

mean  ^,(r^).  which  may  be  nonsiationarv.  We  focus  on  the 


h,{i)  -  1^0,  \.  .r  (2  1) 

Proportional  rates  would  correspond  to  constant  o^,  with  ) 
being  the  baseline  rate.  Taking  logarithms  yields 

log(/i,(r))  *  log(/i"(r))+ log(o,(r)).  (2.1) 

or.  reparameterizing  one  has 

(*,(0)  =  »'’(0  *  Q,(0.  '=-0.  i=\,  (2.2) 

The  degree  lo  which  the  a,,  or  o,.  aie  not  constant  corresponds 
to  how  much  the  proportional  rates  assumption  is  violated  This 
sugpe^tv  that  one  could  evaluate  the  degree  of  nonproportionality 
bv  evtimaiing  r/ .  or  equivalently  o,.  and  plotting  these  againsi 


in  whut.  p  u  \  ))  _ 

^plii  mio  '  -  I  in  m‘-.  lot 
the  levMKtion  ih.ii  v  «  i  ,■  i 


I  if.  >.  Tlui<>  the  log  likelihood  can  be 
I  inm‘-.  lot  H  {lud  tor  »  1.  .r.  with  the 

I  ih.li  V  o  I  ■  I  »  I 


3.  Penali/rd  Maximum  Likelihood  Rsiimales 

Vk >  n(’w  mipt'vr  a  penaliv  on  the  rshmators  lo  insure  a  cer¬ 
tain  smwihness  not  guaiantred  by  the  likelihood  as  written.  The 
penali.Td  mavimum  likelihood  estimate  (MPLE)  for  the  baseline 
rate  (Banos/viiski  ei  al  .  1981;  O'Sullivan,  Yandell,  and 
Ravnof.  )!  .  1984)  can  be  loiind  by  minimizing,  (or  fixed  X. 

/.(«'  ,X)  -  /.(P'  )  4-  kj(e")  (3.1) 

in  which  J()  IS  an  appropriate  penalty  function,  typically 

Mf)  =  /  (f''"'(0)'  di  (3.2) 

with  m  =  I  or  2  for  penalty  on  the  slope  or  curvature,  respec¬ 
tively.  A  large  value  of  the  penalty,  or  smoothing,  parameter  X 
forces  8’  to  be  ncarlv  linear,  while  a  small  X  allows  8^’  to  inter¬ 
polate  the  data 

(hr  snuK'ihing  vpline^  incorprvraie  a  prior  belief  that  the 
true  curve  is  smooth  in  a  certain  sense.  The  smoothing  parame¬ 
ters  A  are  chosen  bv  means  of  generalized  cross  validation 
(Craven  and  Wahba.  1979),  which  tries  to  minimize  the  mean 


1  Section  2  formulatev  the 

pf('hlein  of  proportional  rates 
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squate  erroi.  torcmp  a  tradeoff  heiueen  bias  and  variance. 

Similar  expressions  can  be  uritien  down  for  determining 
the  MI’LK  of  u,  for  each  i , 

/.(II, .X,)  -  /.(u,)  +  X,/(a,)  (3.3) 

When  »>'  I  and  x  ,  ^  .  the  constant  MPLEs  are 

ti,  =  log(r),  I  =  \,  .r. 

The  eviiniaiion  problem  can  be  split  into  1  nimintizaiion  prob 
lems.  for  H'  and  for  q,  .  i  ^  1 ,  .r.  provided  vac  are  willing  to 

Ignore  the  restriction  that  Of  course,  such  a  restric¬ 

tion  could  be  impo.sed.  but  it  would  place  awkward  constraints  on 
the  snKxnhing  penalties. 

3.1.  I>a(a  over  lime  Iriiervab 

I  he  daiii  iv'iividered  in  Section  is  grotiped  bv  2  or  3  dav 
miervab  \\  nh  ihiv  design  inihalance,  the  estimaies  of  ft'  and  of 
(1  niav  Ive  bia'-ed.  dei>eiufing  on  the  panern  of  grcniping.  How- 
e\ei,  ilie  mu < 'tid»iu'nal  evpeci.ihon  of  the  esnmaie'*  i'  unbia«-e(l 
pi  O' Hir'd  iIj.ii  ihr  p.inein  of  gioupm^’  !<•  iiidepeiKfeni  of  tfie  vnup 
of  an  invlnulu.il  We  lan  adiiivi  the  ('enali-'ed  lii,elih(»od  evprev 
sions  in  a  natural  w.iv  to  actouni  for  the  reduced  data,  nanielv. 

/ (h  I  *  X  ^ '  j f  ,  1- )1  < I  4) 

1  M  l  (  ^  1 

/■  m  ,  >  f  .  fog  'if  “  f 

”  M.  I  1  )  .  ,  -^r,,  ) 

in  which  ji,,  -  ,  o .  That  is,  fot  each  i.  there 

weie  I-  distinct  times  i,  at  which  counts  were  made  These 
counts  encompass  d,,  davs  each,  and  the  proportion  of  da\s  for  r 
out  of  the  total  count  ^ ^  is  ,  These  technical  adjust 

menis  were  used  for  computing,  but  are  not  pursued  further  in 
this  paper. 

3.2.  Survival  and  Oviposilion 

Throughout  the  leafhopper  studv,  individuals  died.  Thus 
group  size  declined  over  time.  These  deaths  can  affect  the  esti 
mate  of  the  "baseline'’  rate  h^',  as  well  as  the  proportionaliu 
terms  ,  even  if  all  the  rates  arc  constant.  This  problem  is  most 
profound  for  small  groups,  such  as  in  the  latter  portion  of  the 
leafhopper  experiment. 

A  vimple  solution  shown  in  the  data  anaivsiv  section  is  to 
factor  out  a  step  function  from  the  baseline  raie.  with  siepv  ai 
times  of  death.  This  can  be  easilv  accomplished  with  partial 
splines  (Shiau.  1985;  W’ahba,  198.1a).  Appropriate  modifica 
tions  can  then  be  made  to  (2.4)  based  on  the  estimated  Mep  sizev 
A  serious  danger  arises  in  overparameieiizing  the  model  with 
steps  for  each  individual. 

4.  Diagnosdcs  for  Poisson  Races 

W>  propose  an  ad  hex;  "confidence  interval"  and  log  likeli¬ 
hood  residuals  for  graphical  inspection  of  proportionaliiv.  At 
present  we  have  no  concrete  results,  but  support  these  tools  b\ 
analogs  to  other  work. 

Several  diagnostics  have  been  proposed  for  penalized  m,i\ 
imum  likelihood  m  the  linear  (least  square^)  nuxlef  with  i.i.d 
errors.  Wahba  (1983b)  proposed  pointwise  confidence  intervals 
based  on  a  Bavesian  model  with  normal  errors.  Carmody. 
Eubank,  and  Ihombv  (1984)  pro|X)sed  jackknife  confidence 
inlervaK  which  performed  poorlv  in  comparison  to  the  intrrvab 
of  Wahha  il9R^|vi  ()iher  diaenostics  bavfd  on  residiiab 


(Eubank.  1984;  (iunst  and  luibank.  198.1)  naturallv  extend  diag 
nosiics  for  un/>en.iJized  prrvlvlenis.  Recent  work  of  Cox  (1984) 
offers  strong  approximation  of  the  penalized  least  squares  estima 
tor  in  the  i.i.d  cave,  under  certain  conditionv  on  the  design 
points  and  smoothing  paranieief.  whicli  lead  tt'  siMiu)janef>uv 
confidence  hand'  it  one  igninev.  hiax  Another  diiecnon  ha^ed 
on  a  supreinum  penaltv  lor  the  regie‘‘>ion  function  (KnaH. 
Sackx.  and  Mvisakei.  1981ah)  yields  bias-corrected  simultaneous 
confidence  bands;  here,  biio  is  accounted  for  b\  a  bias  correc¬ 
tion. 

We  adapt  Wahha  (1981b)  to  the  non-i.i.d.  case  and  argue 
in  an  ad  hot  fashion  that  tho  might  have  reavonahle  pioperiie'^ 
fot  out  problem  We  consider  the  model 

V  y-i,  y  \'(0.ink)  's,.,.'.  ‘  V'.'d.^L). 
with  ^  diagonal.  1  he  posterior  esiimatcu  of  g  Is 
y  =  £(fU)  =  Su'i;.., 

The  covariance  is  derived  in  an  analogous  fashion  as 

C(n(s|X)  =  (/+H,)T„,/(mX)  =  (4.2) 

This  suggests  an  approximate  95^*  confidence  interval  for  g, 

r  l.96fT^\vTj‘(X)  (4.3) 

Now-  suppose,  for  fixed  i.  we  let  "  log(  ^  -  k^,))  and 

approximate  the  covariance  to  first  order. 

o*  =  2rexp(-o,(r^))/>:,  1.  ■  ■  -  ,n. 

The  estimated  confidence  interval  for  o,(/^)  becomes 

i.96\  2/7^,(X,)rexpl-d,^(r^))^) ^  (4.4) 

This  approach  has  some  problems,  as  the  solution  to  the  penal¬ 
ized  log  likelihood  is  not  the  same  as  the  solution  to  a  logit 
regression  with  normal  errors.  Wr  will  pursue  this  In  later  work 
using  ideas  of  I.eonard  (1982). 

We  propose  an  ad-hoc  test  of  the  hxpolhesis  of  constant 
proponionaliiv  bv  computing  the  difference  in  deviances  between 
(he  smooth  and  constant  esiimaies. 

/.>(!  .X)  -  2|/.  (u  , )“  i  (a  1=1.  '  .1 .  (4.5) 

with  d,,^(>  being  the  'plme  estimate  of  u,(>  loi  fixed  smoothing 
parameter  x  and  d,  the  estimate  for  constant  o,.  In  other 
words.  />(».X)  IS  simplv  the  deviance  between  the  constant  and 
the  smoothed  logM  models  We  suptK'se  that  this  smtistic  mav 
have  appioximaielv  a  chi-square  disnihuiion  with  degrees  of  free¬ 
dom  (n-  1)-  iro<c{f~‘  ).  W’c  will  compare  this  with  the  usual 
likelihood  ratio  siaiisiic.  />(/)=  2/,(d, )  with  n~l  degrees  of 
freedom,  m  the  data  analvsis  section. 

Expression  (4,5'>  suggests  examining  the  deviance  contri¬ 
butions  at  !  ((ireen.  1984;  Prepilxm.  1981* 

-  |2),, (logo, ,1-1.,, ((,))!'''■.  (4.()) 

With  the  sign  ifie  same  as  ifiat  of  ~  cxpoi  ,^  (/  ))>  l  ot 

given  1j .  dll'  1'  .ij'i'i  (»\nn.ue!\  thus  laigr  pc'sjiive  ot 

negative  value-  'ugce-i  'lynifKam  de- lation-  Howevei.  the 
graphical  'te'i-'  ai  dilteieut  aie  higlilv  conelated.  and  a 
graphical  ploi  oi  t  vei'u-  U»git  residuals  cannot  be  viewed  as  a 
global  te-t 

5.  Data  .AnaKsis 

We  coii'idei  d.it.i  li(»m  a  lahoraioiv  expennieni  ciMuUicied 
hx  Hogr  i|osii  Ml  wliuti  female  piuaio  leallu'rpeis  ueie  kept  iii 
controlled  lalH>tau*i\  cvuulitions  at  one  of  three  fluctuating  tern 


pcraiure  regimes.  We  hxus  here  onlv  on  ihe  cold  regime.  W'e 
examine  ihe  baseline  lor  Ihe  2.^  iemales  in  this  group  along  with 
Ihe  proportional  icrm  for  luo  of  ihtfse  females.  A  more  compleie 
analysis  is  In  progress  foinilv  wrih  Dawd  Hogg,  Emomology 
Departmeni,  l!W -Madison,  who  kindlx  offered  the  data  lie  col¬ 
lected. 

All  individuals  have  grouped  records,  that  is  counts  of  eggs 
for  1-3  da>  intervals.  Also,  individuals  were  removed  from  Ihe 
study  In  death,  either  natural  or  accidental  (due  to  handling). 
We  assume  that  the  grouping  does  not  introduce  any  bias  in  the 
estimation  of  the  baseline  rate,  and  that  we  are  interested  in  the 
baseline  rate  and  proportionality  terms  at  any  lime  onlt  for  those 
Icafhoppcrs  which  were  alive.  W'e  initially  proceed  as  if  survival 
did  not  affect  bias,  and  later  correct  for  survival  as  indicated  in 
Section  3.3. 

Figure  5.1  shows  the  baseline  rate  and  the  rates  for  indivi¬ 
duals  22  and  23.  Note  the  rise  to  a  fairlv  constant  rate,  with  gra¬ 
dual  decav.  The  raw  proportionalitx  for  individuals  22  and  23 
arc  plotted  alongside  curve  estimates  with  penalties  for  slope  and 
for  curvature  in  Figures  5.2-3.  The  curve  e.stiniate  based  on  a 
penaltx  for  non-zero  slope  appear  much  rougher  than  the  curves 
based  on  curvature  penalty.  Approximate  95^  pointwisc  confi¬ 
dence  inierxals  for  the  proportionalitx  estimates,  based  on  the 
curvature  penaltx.  are  shown  in  Figures  5.4-5. 

The  likelihood  ratio  statistics  with  degrees  of  freedom  and 
p-vatue  are  shown  in  Table  5.1.  Note  the  great  reduction  in 
degrees  of  freedom  for  the  penalized  curxes.  while  the  deviances 
Stax  fairK  high.  Figure  5.6-7  show  the  logit  deviances  over 
time. 

Table  5.1  Smooth  Deviances 


022: 

Deviance 

d.(. 

log(X) 

constant 

188,21 

68. 

X 

ni*  1  (slope) 

Il7,6h 

14.88 

-6 

n)»2  (curvature) 

99. 4y 

8.8.1 

■12 

*23: 

const.int 

!  1,1.99 

64. 

X 

m  s=  1  (Slope) 

6.V.S^ 

5.87 

■4 

m  * 2  (ciirxaiurei 

62. lo 

.1.35 

•8 

We  tonchidf  wiih 

(ijrsf  e.tunaie. 

for  the 

baseline  once 

one  iidjiisi'  for  the  survival  pf{Kes‘>,  figure  shows  the  naive 
and  adiusted  baseline  rate  esiiniates  for  the  cold  regime.  One 
see.'  thni  sur\j\,if  has  little  effect  on  the  baseline  rate  for  most  of 
the  experiment,  though  estimates  at  the  laiei  times  can  be 
affected, 
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C-LAB.  AN  INTERACTIVE  SYSTEM  for  CLUSTER  ANALYSIS 


M.B.  SHAPIRO  and  C.D.  KNOTT 


Division  of  Computer  Research  and  Technology,  NIH,  Bethesda,  MD  20205 


The  C-LAB  system,  for  doing  cluster  analysis  and  related  work.  Is  described*  It 
differs  from  other  cluster  analysis  systems  In  that  (1)  it  is  interactive,  (2)  It  has 
Its  own  bullt-'ln  Language  in  which  user  algorithms  can  he  programmed,  (3)  It  contains 
many  bullt-ln  functions  for  matrix  manipulation,  numerical  analysis  and  statistics, 
and  (4)  It  is  display  oriented,  having  commands  for  producing  publication  quality 
clustering  diagrams* 


1.  INTRODUCTION 


C-LAB  is  an  on-line  Clustering  I.ABoratory 
facility  that  runs  on  DECSYSTEM-10  and  -20 
computers*  It  consists  of  a  collection  of 
subroutines  which  implement  many  of  the  most 
commonly  used  techniques  In  cluster  analysis, 
plus  some  miscellaneous  related  methods*  C-LAB 
Is  a  subset  of  operators  in  the  MLAR  (Modeling 
LABoratory)  system  (Knott  1979),  wl  h  has  Its 
own  high-level  language  for  writing  programs* 
C-LAB  differs  from  other  cluster  analysis 
packages  in  three  main  ways:  (1)  It  Is 
interactive,  (2)  It  has  a  bullt-ln  language  (the 
MLAB  language),  and  (5)  it  la  display  oriented* 
MLAB  provides  matrix  manipulation  and  display 
facilities  and  has  many  built-in  functions 
useful  in  statistics  and  numerical  analysis* 
C-LAB  Is  run  on  display  terminals  and  since  It 
Is  Interactive,  results  and  drawings  arc  seen  as 
they  are  computed*  MLAR  has  its  own  commands 
for  drawing  pictures,  and  these  are  supplemented 
by  C-LAB  operators  for  preparing  the  output  of 
cluster  analysis  algorithms  for  drawing,  A  user 
can  program  his  own  algorithms  not  available  In 
C-LAR,  As  pointed  out  by  Atiderberg  (1973)  most 
cluster  analysis  methods  are  relatively  easy  to 
program*  Such  special  algorithms  can  he 
programmed  as  subroutines  (called  00  files  in 
MI.AB)  and  invoked  to  process  specific  data* 


2.  THE  C-LAB  OPERATORS 


There  are  many  aspects  to  cluster  analysts, 
including  the  choice  of  data  units,  variables, 
clustering  criteria,  and  of  what  to  cluster,  the 
method  of  homogenizing  variables,  the 
comptitatlon  of  similarity  measures  and 
clustering  algorithms,  and,  finally,  the 
presentation  and  interpretation  of  the  results, 
Tliese  aspects  are  dealt  with  by  C-LAH  as 
described  in  the  following* 


Most  C-LAB  operators  work  on  a  data  matrix, 
where  each  row  represents  a  data  point  (also 
called  a  sample  or  an  object)  In  n  dimensions* 
The  columns  are  called  variables  (or  features  or 
attributes).  Data  to  be  clustered  must  have 
similar  scales  of  values;  C-LAB  has  two 
operators  for  scaling* 

There  are  many  measures  of  similarity  (or 
dissimilarity)  between  pairs  of 
objects,  the  most  common  being  the  euclidean 
distance.  C-LAB  has  an  L^  distance  metric  built 
in  as  a  dissimilarity  operator.  Other 
dissimilarity  measures  are  usually  easy  Co 
program. 

The  basic  Idea  of  cluster  analysis  Is  to 
partition  a  set  of  n-dimenslonal  points 
representing  measurements  or  descriptive  values 
of  an  object  (e.g,  measurements  of  different 
parts  of  a  plant,  or  symptoms  of  a  disease)  into 
groups  called  clusters*  The  number  of  and 
nature  of  the  clusters  may  or  may  not  be 
specified,  and  the  clusters  are  to  be 
discovered*  Also  of  interest  are  the  properties 
of  the  points  which  determine  to  which  cluster 
they  belong. 

The  usual  paradigm  for  cluster  analysis  Is  to 
define  a  similarity  measure  or  metric,  d(x,y), 
which  produces  a  numerical  measure  of  how 
similar  the  two  points  x  and  y  are.  The  choice 
of  such  a  metric  can  be  crucial  and  is,  of 
course,  left  to  the  user.  Once  the  metric  Is 
chosen  clusters  can  be  defined  in  various  ways, 
based  on  grouping  similar  points  together* 

The  main  part  of  ('-LAB  consists  of  the  operators 
for  doing  clustering.  There  are  operators  for 
each  of  the  three  broad  categories  of  clustering 
algorithms :  hierarchical  clustering, 

non-hlerarchlcal  clustering,  and  approaches 
using  graph  theory.  The  hierarchical  clustering 
operators  are  those  for  computing  and  drawing 
dendrogram'  ,  Clusters  are  determined  by 
visually  examining  the  drawing;  there  is  no 
algorithm  In  C-LAB  for  selecting  clusters  from 
dendrograms*  Non-hierarchichal  clustering  Is 


done  In  C-LAB  using  a  variant  of  the  K-means 
algorithm:  objects  are  put  into  separate 
clusters,  using  a  minimum  variance  optimizing 
criterion,  and  information  about  each  cluster  is 
then  printed  out,  rather  than  drawn  as  It  Is  for 
dendrograms,  A  graph  theory  approach  to 
clustering  Is  implemented  In  C-LAB  through  the 
minimal  spanning  tree  operator  and  related 
operators  for  "breaking"  certain  "Inconsistent" 
tree  edges.  Clusters  are  then  defined  as  the 
resulting  subtrees, 

Graphical  output  is  a  specialty  of  MLAB  and 
there  are  a  number  of  C-LAB  operators  used  for 
displaying  results  as  drawings.  In  addition  to 
the  standard  MLAB  facilities  for  drawing  graphs 
there  are  C-LAB  operators  which  compute  matrices 
from  which  dendrograms,  minimal  spanning  trees, 
and  Chernoff  faces  (Chernoff  1973)  can  he  drawn. 
There  are  two  operators  for  reducing  the 
dimensionality  In  a  set  of  data  and  they  can  be 
used  to  obtain  a  plot  of  the  data  In  2 
dimensions,  and  In  3  dimensions  also,  since 
there  are  commands  for  drawing  pictures  In  3D, 

The  C-LAB  operators  are  organized  into  six 
categories:  scaling,  feature  reduction,  cluster 

analysis,  output,  trlangulatlon,  and 
miscellaneous.  At  present  the  following 
operators  are  available: 


Sen  1 Ing 

Feature  reduction 

AUTOSCAI.E 

FISHERRANK 

RANGESCALE 

PRCOMP 

NLM 

Cluster  analysis 

MST 

Output 

INCONSISTENT 

CLUSTERINFO 

trekclusters 

DENCURVE 

ALINKAGR 

FACESCURVE 

clcmkage 

TREECURVE 

centroid 

ward 

Trlangulatlon 

KMEANS 

DELAUN 

DELCURVE 

Miscellaneous 

VORCURVE 

CLIfSTEKERROR 

VORSTAT 

COPHEN 

DISTANCES 

Auto-scallng  and  range-scaling  are  used  for 
scaling  the  variables  of  a  data  matrix  by 
normalizing  them  or  by  putting  them  In  a  0-1 
range.  The  FISHERRANK  operator  Is  the  standard 
Fisher  discriminant  ratio,  and  Is  used  for 
ranking  the  variables  of  a  data  set  according  to 
their  ability  to  discriminate  between  known 
categories  for  the  data.  PRCOMP  and  NLM  perform 
principal  components  and  non-linear  mapping 
algorithms,  and  are  used  for  reducing  the 
dimensionality  of  a  set  of  data.  The  non-linear 
mapping  algorithm  is  from  Chang  and  lee  (1973), 

For  the  rltister  analysis  operators,  MST, 
INCONSISTENT,  and  TRERCLUSTKRS  are  used  for 
computing  a  minimal  spanning  tree,  then  for 
finding  "Inconsistent"  edges  in  that  tree, 
"breaking"  them  and  determining  the  resulting 


clusters.  This  approach  Is  based  on  the  work  of 
Zahn  (1971).  MST,  ALINKACE.  CLINKACE,  CENTROID, 
and  WARD  are  the  operators  for  computing 
dendrograms  based  on  single  linkage,  average 
linkage,  complete  linkage,  centroid  linkage,  and 
Ward's  method.  KMEANS  Is  one  of  the  many 
variants  of  the  K-means  algorithm,  this  one 
taken  from  Hartigan  (1975). 

The  CLUSTERERROR  operator  computes  the  cluster 
error  from  a  given  clustering  solution,  such  as 
computed  by  the  K-means  operator.  COPHEN  is 
used  to  compute  the  correlation  between  the 
dlsslmliarity  matrix  for  a  data  set  and  a 
dendrogram  computed  for  the  data.  (There  are 
operators  in  MLAB  for  computing  correlation  and 
covariance  matrices.)  DISTANCES  is  an  algorithm 
for  computing  Minkowski's  distance  metric,  and 
is  used  for  creating  a  dissimilarity  matrix. 

The  euclidean  distance  metric  Is  most  commonly 
used. 

The  output  operators  compute  matrices  that 
contain  a  summary  of  Information  about  clusters 
(CLUSTERINFO) ,  or  matrices  used  to  draw 
dendrograms  (DBNCURVE),  Chernoff  faces 
( FACESCURVE) ,  or  minimal  spanning  trees 
(TRERCIJRVE).  Examples  of  the  use  of  DENCURVE 
and  TREECUKVE  are  given  below. 

In  addition  to  the  operators  directly  related  to 
cluster  analysis  there  are  four  used  for  the 
trlangulatlon  of  a  set  of  points  In  the  plane. 
The  trlangulatlon  is  done  by  the  DELAMN  operator 
(for  Delaunay  trlangulatlon,  defined  below),  the 
trlangulatlon  drawing  by  DELCURVE,  the  computing 
and  drawing  of  nearest  neighbor  (Voronol  or 
Dlrichlet)  regions  by  VORCllRVE,  and  statistics 
related  to  the  Voronol  regions  are  computed  by 
VORSTAT,  The  triangulations  algorithms  are  from 
Lee  and  Schachter  (1980)  and  Shapiro  (1981). 

3.  THE  MLAB  LANGUAGE 


The  MLAB  language  Is  extensive.  Only  a  brief 
Introduction  to  the  statements  and  operators 
that  would  likely  be  used  by  C-LAB  users  Is 
given  here,  however  there  are  also  operators  for 
matrix  manipulat ion ,  curve  fitting,  differential 
equation  solving,  and  Integration  of  functions, 
plus  commands  for  input  and  output,  including 
drawing  pictures . 

Assignment  statements  are  similar  to  those  I 
other  computa t Iona  I ly-oriented  languages ,  having 
the  form 

variable-name  =  expression 

where  the  variable  Is  a  scalar  or  matrix 
depending  on  whether  the  expression  Is  a  scalar 
or  matrix  one,  MLAB  Is  a  higher  level  language 
than  FORTRAN  or  BASIC,  and  no  declaration  of 
variables  Is  needed.  Expressions  have  the  same 
form  as  In  other  high  level  languages,  for 


^  ^  ^  ^  ^ 


example  a  root  of  a  quadratic  equation  would  be 
expressed  as 

(-B+SQRT(B"2-4*A*C))/(2*A) 

Scalars  and  matrices  are  created  and  manipulated 
through  assignment  statements  and  through  the 
use  of  built-in  and  user-defined  functions* 
Operators  for  matrices  Include  the  following: 

A  &  B  Concatenate  matrix  B  below  A* 

A  B  Concatenate  matrix  R  to  the 

right  of  A, 

A  *  B  Ordinary  matrix  multiplication* 

A  B  Multiply  corresponding  elements* 

A'  "  Indicates  the  transpose. 

Some  of  the  commonly  used  built-in  functions  are 
the  following: 


F  ON  2:9 


P01NTS(F,A:B) 


A:B:C 

NROWS(X) 

NCOLS(X) 

RKAD(DATA,M,N) 

SORT(X,C) 

SUM(l,A,B,E) 

CORR(X) 

LtST(ei.e2,.,.,En) 

CROSS(l:M,l:N) 


The  values  A,A+C,A, A+2C, 
... ,B.  If  C  Is  omitted 
then  C*l  is  assumed* 

The  number  of  rows  In 
matrix  X. 

The  number  of  columns  In 
matrix  X* 

Input  data  from  file 
DATA  Into  an  MxN  matrix* 
Sort  matrix  X,  using 
column  C  as  the  key* 

The  sum  of  expression  E 
for  index  t  running  from 
A  to  B.  Usually  E 
contains  index  I. 

The  correlation  matrix 
for  matrix  X* 

A  one-column  matrix  with 
n  elements,  EI,E2,,.*,En 
are  expressions* 

An  MxN  matrix  containing 
(1,1)  (1,2)  ...  <M,N). 


Specific  rows  and/or  columns  of  matrices  can  be 
referenced,  as  In 


X([,J) 

Y  ROW  1:5 

Z  ROW  A:B  COL  C:n 


The  I,J  th  element  of 
matrix  X, 

Rows  1  to  5  of  Y . 

(":"  Indicates  through.) 

Columns  C  to  0  of  rows 
A  to  B  of  Z, 


U  Is  a  column  vector  of 
(F(2),F(3)....,F(9)). 

A: B  in  column  1 , 

(F(A) . F(B))  in 

column  2* 


Pictures  are  drawn  with  the  DRAW  statement, 
which  specifies  a  matrix  of  coordinates  to  be 
drawn  in  a  window.  The  window  specifies  the 
position  of  an  imaginary  box  around  the  data  on 
the  display  screen,  e.g. 

WINDOW  W,  10  BY  20,  AT  0,0 

Indicates  that  window  U  Is  10  data  units  by  20 
data  units,  with  the  lower  left  of  the  screen 
having  coordinates  0,0.  Thus  the  point  (5,10) 
would  he  plotted  In  the  middle  of  the  screen. 
The  STRING  statement  Is  used  to  display 
characters,  as  In  the  following: 


STRING  "ABC"  IN  W,  AT  5,10 


ABC  is  drawn 
starting  at 
5,10. 


The  DRAW  statement  has  a  number  of  options. 

Some  of  those  that  are  used  with  C-LAB  are 
illustrated  In  the  following,  where  Z  is  a  2 
column  matrix  of  (x,y)  coordinates  and  W  Is  a 
window  as  described  above. 

DRAW  Z  IN  W,  LINE  1  The  points  are  connected 
by  a  solid  line • 

DRAW  Z  IN  W,  LINE  0.  LABEL  WITH  1:NR0WS(Z) 
The  points  In  Z  are  labeled 
with  consecutive  Integers 
and  the  points  are  not 
connected  (LINE  0), 

DRAW  Z  IN  W.  LINE  6  Line  type  f>  specifies 

lifting  the  pen  between 
curve  segments* 


4.  EXAMPLES 


Five  examples  are  given  here  to  illustrate  the 
type  of  programming  and  picture  drawing 
associated  with  C-LAB.  As  can  be  seen,  quite  a 
bit  is  accomplished  in  a  few  statements. 


& 


The  essential  Ingredients  In  most  MLAB  programs 
are  the  function  statements.  Functions  are 
defined  as  In  the  following: 


FCT  K(X)»A*X^2+B*X+C 
FCT  G(T)-A*EXP(-B*H(T)) 


A  quadratic 
function , 

H  Is  a  previously 
defined  function, 

FUNCTION  MAX(A,B)-IF  A<B  THEN  B  ELSE  A 

Max  of  A  and  B, 

Functions  are  computed  using  the  ON  and  POINTS 
operators,  as  In 


4,1  Jaccard's  Coefficient 


For  presence-absence  data,  association 
coefficients  are  used  for  similarity  meastires, 
Jaccard's  coefficient  Is  Illustrated  here.  For 
two  m-vectors  X  and  Y  containing  0  and  I  values, 
J  Is  computed  as  a/(a+b+c),  where 


a»the  number  of  places  where  both  X  and  Y  are 

b»the  mimber  of  places  where  X*1  and  Y»0 

C"the  number  of  places  where  X»0  and  Y»l. 


FCT  A(X,Y)-SUM(I,1,M,X(II  AND  Y(lJ) 

FCT  B(X,Y)-SUM(l,l,M,X[li  AND  NOT  YlU) 
FCT  C(X,Y)-SUM(l,l,M,NOT  X[l!  AND  Yllj) 
X=-READ(DATA1,M);  Y-READ(DATA2  ,M) 
AA-A(X,Y)  'Vompute  a" 

J  -  AA/(AA+B(X,Y)+C(X,Y)) 


4,2  Finding  Nearest  Neighbors 


The  distances  shown  at  the  left  of  the 
dendrogram  are  drawn  separately.  They  range 
from  0  to  the  maximum  value  found  In  (column  3 
of)  matrix  A.  900  Is  used  here.  The  X 
coordinate  for  these  numbers  In  the  window  Is 
.02,  and  the  Y  coordinates  go  from  .05  to  .95. 
The  numbers  are  drawn  as  follows: 

L  -  .02  (.05:. 95:, 3) 

DRAW  L  IN  W,LINE  0, LABEL  WITH  900:300:-300 


The  euclidean  distances  (squared)  of  point  0 
(nxl)  to  each  of  the  points  in  mxn  matrix  X  are 
computed  and  sorted  and  put  Into  matrix  IX, 
which  also  contains  the  corresponding  Indices  in 
column  2,  Thus,  after  the  code  below  Is 
executed,  the  Index  of  the  point  closest  Co  Q  is 
in  IXll,2]  and  the  distance  of  it  Co  Q  is  in 
IXlMl, 

FUNCTION  DIST(I)-SUH(J,l,N,(X[I,.n-Q(J])*2) 

D  »  DtST  ON  I :M 

IX  -  S0RT(D  l:M,l) 


4.3  Drawing  a  Dendrogram 


In  this  example  an  average  linkage  dendrogram  is 
drawn.  First  the  steps  are  explained,  then  the 
C-LAB  statements  for  executing  the  steps  are 
given.  (The  other  linkages  would  be  done 
similarly,  changing  only  step  4.)  The  dendrogram 
is  shown  In  figure  1. 


(1)  Create  a  Ixi  window  in  which  the 
dendrogram  Is  to  be  drawn. 

(2)  Input  Che  mxn  data  into  matrix  X,  In  this 
case  20x5, 

(3)  Compute  a  dissimilarity  matrix  0  for  the 
data  • 

(4)  Compute  (m-l)x3  matrix  A  defining  the 
dendrogram. 

(5)  Use  A  to  create  matrix  Q  with  coordinates 
(In  columns  1  and  2)and  labels  (In 
column  3)  for  drawing  the  dendrogram. 

(6)  Draw  the  dendrogram,  using  columns  \ 
and  2  of  Q  to  draw  the  lines 

and  column  3  for  the  labels  at  the  top. 

WINDOW  W,  I  BY  I,  AT  0,0 
X  -  READ(DATA,20,5) 

D  -  DISTANCES(X) 

A  -  ALINKAGE(D) 

0  -  DENCURVE(A) 

DRAW  Q  COL  1:2  IN  W,  LINE  6, 

LABEL  WITH  Q  COL  3 


Figure  1:  Average  Linkage  dendrogram  for  20x5 
data 


4.4  Drawing  a  Minimal  Spanning  Tree  on  a 
Non-linear  Map 


The  NLM  (non-linear  mapping)  operator  Is  used  to 
project  the  points  in  some  higher  dimension  to  2 
dimensions,  preserving  the  InterpolnC  distance 
relationships  as  much  as  possible.  There  Is 
Inevitably  some  distortion,  and  one  way  of 
assessing  It  is  Co  superimpose  a  minimal 
spanning  tree,  with  edge  length  labels,  on  the 
non-linear  map,  since  the  edges  in  the  tree 
represent  nearest  neighbors  connections.  This 
type  of  combination  was  suggested  by  Kruskal 
(1977),  It  Is  easily  done  in  C-LAB,  as  Is 
described  In  the  following  steps.  The  C-LAB 
statements  are  shown  at  the  end.  Figure  2  shows 
a  tight  cluster  containing  points  I  to  10  and  a 
loose  cluster  of  points  11  to  20,  and  Indicates 
that  the  non-linear  map  represents  the  data 
well  • 

(1)  The  algorithm  starts  with  the  mapped 
points  In  the  D-l  range,  then  on  each 
Iteration  the  points  can  move  out  of  that 
range.  Therefore  a  window  Is  set  up 


(5)  Draw  the  Voronol  diagram.  Line  type  6 
Is  used  to  lift  the  pen  between  segments 
of  the  diagram, 

(6)  Label  the  points, 

WINDOW  W,  I  BY  1,  AT  0,0 
X  -  READ(DATA,16,2) 

D  -  DELAUN(X) 

V  -  VORCURVE(X,D) 

DRAW  V  IN  W,  LINE  6 

DRAW  X  IN  W.LINE  0, LABEL  WITH  1:16 


Figure  3:  Voronot  diagram  for  16  points  In 
the  plane 


3,  Discussion 


Rlashfleld  et  al  (19B2)  present  a  good  summary 
of  the  current  state  of  cliister  analysis 
software.  They  divide  such  software  Into  3 
categories  and  include  C-LAB  In  with  cluster 
analysis  packages,  the  main  ones  being  CLUSTAN 
(Wlshart  1978),  NT-SYS  (Rohlf,  Klshpaugh,  and 
Kirk  197^),  and  CLUS  (Ruhln  and  Friedman  1967). 
(There  is  one  mistake  made  In  describing  C-LAB: 
It  has  3,  not  3,  linkage  methods,)  C-LAB  does 
fit  in  that  category  more  than  In  any  other,  but 
it  has  some  unique  features  that  separate  It 
from  all  the  other  cluster  analysis  software 
discussed.  These  are  that  It  is  used 
Interactively,  has  a  bullt-ln,  high  level 
language,  and  Is  oriented  around  graphics 
terminals,  having  bullt-ln  capabilities  for 
drawing  high  quality  pictures.  However,  C-LAB 
does  not  have  the  extensive  set  of  clustering 
commands  available  in  some  other  packages, 
notably  CLUSTAN  and  NT-SYS,  This  Is  somewhat 


overcome  by  the  fact  that  the  C-LAB  user  can  in 
many  cases  program  his  own  special  algorithms. 
Thus,  whereas  C-LAB  has  only  one  operator 
(DISTANCES)  for  computing  dissimilarity  values, 
it  is  usually  easy  to  program  others,  as 
illustrated  above  for  the  Jaccard  coefficient. 
This  language  feature  can  be  considered  a  plus, 
but  It  also  means  that  a  beginner  would  have 
more  trouble  than  with  say  CLUSTAN,  which  has  38 
different  similarity  measures  available. 

Being  Interactive,  C-LAB  Is  designed  to  be  used 
differently  than  other  cluster  analysis  systems. 
Rather  Chan  the  user  having  Co  know  beforehand 
the  exact  series  of  computations  to  be  done, 
succeeding  steps  are  based  on  current  results. 
The  value  of  this  feature  depends  on  the 
particular  work  being  done. 

The  usefulness  of  the  graphical  capabilities  of 
the  C-LAB  language  can  be  attested  by  the  fact 
that  most  of  the  techniques  found  in  Everltt 
(1978)  for  displaying  multivariate  data  are 
either  already  available  as  C-LAB  operators  or 
are  easily  programmed.  The  former  Include  the 
operators  PRCOMP  (principal  components 
analysis),  NLM  (non-linear  mapping),  MST, 
CLINKAGE,  ALINKAGE,  CENTROID,  and  WARD 
(hierarchical  clustering),  and  FACESCURVE 
(Chernoff  faces).  The  latter  Include 
probability  plots  ((Person  1975),  Andrews  plots 
(1972),  and  blplots  (Gabriel  19?1), 

Copies  of  the  system  documentation  and  the 
MLAB/C-LAB  program  are  available  by  writing  the 
authors. 
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SCATTERPLOT  MATRIX  TECHNIQUES  FOR  LARGE  N 


D.  B.  Carr,  R.  J.  Littlefield,  and  W.  L.  Nicholson 


Pacific  Northwest  Laboratory 
Richland,  WA  99352 

High-performance  interaction  with  scatterplot  matrices  is  a  powerful  approach  to 
exploratory  multivariate  data  analysis.  For  a  small  number  of  data  points,  real-time 
interaction  is  possible  and  overplotting  is  usually  not  a  major  problem.  However, 
when  the  number  of  plotted  points  is  large,  display  techniques  that  deal  with 
overplotting  and  slow  production  are  Important.  This  paper  addresses  these  two 
problems  in  the  context  of  display  devices  that  have  a  color  look-up  table.  Topics 
include  compromised  brushing,  film  loops,  and  density  representation  by  gray-scale 
or  by  symbol  area.  The  paper  also  discusses  techniques  that  are  generally 
applicable,  including  interactive  graphical  subset  selection  from  any  collection  of 
scatterplots,  and  comparison  of  scatterplot  matrices. 


1.  INTRODUCTION 

A  scatterplot  matrix  for  p  variate  data  is  the 
ordered  display  of  p*(p-l)  scatterplots  as 
shown  below  in  Exhibit  1.  Since  1980,  many 
descriptions  of  scatterplot  matrices  have 
appeared  in  statistical  graphics  1 iterature.[l, 
2, 3, 4, 5,6].  With  different  names  and  modest 


variations,  the  important  themes  prevail.  The 
two  themes  are  1)  scatterplot  matrices  provide 
an  effective  approach  to  exploratory 
multivariate  data  analysis  and  2)  scatterplot 
matrices  can  be  enhanced  to  provide  more 
Information.  Undoubtedly,  scatterplot  matrices 
and  a  variety  of  enhancement  procedures, 
including  transformations,  smoothings,  missing 
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Exhibit  1:  Scatterplot  collection.  Data  in  the  scatterplot  matrix  are  multiple  measurements  on 
Individual  rain  samples  collected  at  nine  sites  in  the  ADS  (Acid  Deposition  System)  network. [7] 
Nitrate  and  Sulfate  measurements  are  ion  concentrations  expressed  in  logarithms  of  micro-moles  per 
liter.  Depth  is  rain  gage  depth  in  logarithms  of  millimeters.  Two  additional  plots  show  site 
location  in  degrees  and  collection  dates  in  decimal  year.  With  41D9  points  (minus  some  missing 
data)  the  overplotting  is  substantial. 


data  representation,  and  Interactive  subset 
selection  and  representation  will  find  their 
way  Into  an  increasing  number  of  statistical 
packages  and  Into  comnon  use.  The  purpose  of 
this  paper,  which  is  a  sequal  to  [5.1,  Is  to 
elaborate  on  interaction,  density  representa¬ 
tion  and  display  techniques  that  are  helpful  In 
representing  a  large  number  of  points,  and  to 
exploit  the  color  look-up  table  on  color  raster 
display  devices. 

An  objective  of  “new"  graphical  techniques  Is 
to  make  the  discovery  of  significant  patterns 
in  data  easier  and  more  likely.  Once  a 
pattern  Is  found,  analyst  Ingenuity  can 
typically  produce  an  alternative  display  that 
shows  the  same  pattern  and  meets  with 
publication  regulations.  Thus,  readers  of 
publications  often  have  little  exposure  to 
techniques  that  are  particularly  effective  In 
the  Interactive  exploratory  setting.  Until 
electronic  journals  become  available,  the  gap 
between  what  Is  useful  In  exploratory  data 
analysis  and  what  can  be  portrayed  In  static 
monochrome  journals  will  remain.  In  this 
paper,  the  Importance  of  the  scatterplot 
matrix  and  graphical  '  Interaction  Is  assumed. 
Little  space  is  devoted  to  description  and 
interpretation  of  data  to  prove  that  patterns 
were  found  that  could  not  have  been  found  any 
other  way.  Thus,  what  Is  shown  here  does  not 
convey  the  speed  or  power  of  the  scatterplot 
matrix  for  finding  significant  patterns. 

2.  LARGE  U 

What  Is  large  depends  on  the  frame  of 
reference.  If  available  plotting  space  for  a 
scatterplot  Is  a  one  Inch  square,  500  points 
can  seem  large.  For  our  purposes,  N  1s  large 
If  plotting  time  Is  much  greater  than  real 
time,  if  straight  forward  plots  can  have  an 
extensive  amount  of  overplotting,  or  if 
computation  times  are  long.  Exhibit  1 
provides  an  example.  If  there  were  no  missing 
data,  each  scatterplot  would  contain  4109 
points.  With  fourteen  plots,  the  total  number 
of  points  in  the  display  exceeds  50000. 
Currently,  few  If  any  display  devices  can 
display  this  number  of  points  In  real  time  (In 
a  fraction  of  a  second).  Thus,  with  comrionly 
available  display  software/hardware,  50000  Is 
large.  The  exhibit  also  fits  the  other 
definitions  of  large.  Substantial  overplotting 
can  be  Inferred  from  plotted  area,  the  dot  size 
and  the  number  of  points,  4109.  Computation 
times  are  also  long  for  some  operations.  For 
Instance  obtaining  graphically  specified 
subsets  from  4109  points  does  not  take  a  lot  of 
time,  but  obtaining  lower,  middle,  and  upper 
smooths  [8],  say  using  LOWES  [9],  does.  Thus, 
the  data  set  for  Exhibit  1  qualifies  as  large 
under  all  three  counts.  Because  display  speed 
and  computing  capabilities  can  be  expected  to 
Improve  dramatically,  the  key  definition  of 


large  concerns  overplotting. 

Large  data  sets  are  common.  Many  monitoring 
studies  generate  large  quantities  of  data.  In 
a  substantial  subclass  of  such  studies,  the 
same  measurements  on  variables  are  obtained  at 
different  times  and/or  at  different  spatial 
locations.  Large  data  sets  then  arise  from 
pooling  data  across  temporal  and  spatial 
Indicles.  Exhibit  1  shows  data  from  9  sites  in 
just  one  of  several  acid  rain  deposition 
monitoring  networks.  Other  monitoring  examples 
Include  seismic  networks,  multispectral 
satellite  Images,  and  so  on. 

From  visual  appearances.  Exhibit  1  might  more 
appropriately  be  called  a  scatterplot  collect¬ 
ion.  That  Is,  Exhibit  1  illustrates  plots  In 
addition  to  a  scatterplot  matrix.  The  indexing 
parameters  of  site  location  and  time  are  shown 
In  separate  plots.  An  addition  would  be  an 
underlayed  map  in  the  site  locations  plot. 
Whatever  the  layout,  there  are  two  key 
concepts;  (1)  the  background  data  structure  is 
an  H  X  P  matrix  of  data;  and  (2)  interactive 
graphical  subset  selection  can  be  driven  from 
any  plot. 

3.  INTERACTIVE  GRAPHICAL  SUBSET  SELECTION  AND 
REPRESENTATION 

One  of  the  most  powerful  enhancement  procedures 
for  scatterplot  matrices  is  to  distinguish 
subsets  of  data  for  comparison  against  each 
other  or  against  the  whole  set.  Interactive 
graphical  subset  selection  Is  particularly 
convenient.  Four  approaches  to  graphical 
subset  selection  have  been  described  In  the 
literature.  The  first  [10]  involves  picking  a 
point  in  a  plot,  and  having  the  subset  include 
the  k  nearest  neighbors.  The  second  approach 
defines  the  subset  by  specifying  a  rectangular 
region  which  contains  the  subset.  The  third 
approach  [5,11]  Involves  drawing  a  polygon 
around  desired  points.  The  fourth  approach 
[6],  called  "brushing",  uses  an  interactively 
specified  rectangular  region  that  can  be  swept 
through  the  plot  to  define  arbitrarily  shaped 
regions.  Points  falling  in  the  region  are  In 
the  subset. 

Omitting  the  nearest  neighbor  approach  which 
is  largely  algorithmic,  brushing  is  the  most 
convenient  form  of  subset  selection  and 
encompasses  the  other  approaches.  Brushing  Is 
particularly  advantageous  when  selected  points 
are  distinguished  In  real  time.  However  when 
the  number  of  points  gets  large,  significant 
time  is  required  to  find  and  redisplay  all 
points  falling  within  the  sweep  of  the  brush 
and  the  display  lags  behind  the  brush.  With 
polygon  selection,  defining  lines  can  be  drawn 
In  real  time.  For  finding  points,  computations 
are  reduced  since  a  simple  boundary  Is 
involved,  as  opposed  to  a  long  sequence  of 
rectangles  generated  by  brushing.  For  storage 
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purposes  the  polygon  definition  is  more 
compact  than  a  long  sequence  of  rectangles  or 
a  vector  indicating  which  points  were  chosen. 
In  addition,  polygon  definition  can  be  readily 
modified  when  applied  to  revised  data  bases. 
Consequently,  for  large  data  sets,  polygon 
selection  is  the  method  of  greater  practlcal- 
ity. 

Color  is  generally  accepted  as  a  good  method 
for  distinguishing  a  small  number  of  subsets. 
When  display  speed  is  a  problem,  as  in 
brushing,  approaches  can  be  taken  that  lead  to 
compromised  but  rapidly  produced  displays.  The 
underlying  trick  is  to  redisplay  only  selected 
points  and  leave  remaining  points  alone.  In  a 
monochrome  setting,  two  types  of  dots  can  be 
used  as  shown  in  the  top  row  of  Exhibit  2. [6] 


£<bib1t  2:  "Exclusive  Or"  Dots.  The  top  row 
shows  two  dot  types,  filled  and  open.  The 
second  row  shows  two  disjoint  filled  dots. 
The  third  row  shows  partially  overplotted 
filled  dots.  The  invisible  fourth  row 
corresponds  to  any  even  number  of  identically 
positioned  dots  of  the  same  state. 


Filled  dots  represent  points  in  the  selected 
set  and  open  dots  represent  points  in  the 
complement  set.  Writing  on  the  bits  in  the 
central  portion  of  each  dot  using  the 
"exclusive  or"  operator  causes  tiie  dot  to 
switch  its  filled/open  state.  The  speed  is 
obtained  at  a  price.  Consider  the  two  filled 
dots  in  each  of  rows  two,  three  and  four  in 
Exhibit  2.  The  second  row  is  fine.  In  the 
third  row  dots  are  partially  overwritten,  so 
what  is  visible  has  a  different  siiape.  The 
invisible  fourth  row  shows  the  blank  created  by 
perfect  overwriting  of  any  two  or  any  even 
number  of  dots  of  the  same  state.  The 
"exclusive  or"  approach  is  not  desirable  for 
large  N  problems  since  the  approach  requires 
large  dots  with  visible  interiors  and 
misrepresents  overplotted  points.  Color 


displays  provide  more  alternatives. 

Color  raster  display  devices  with  a  color 
look-up  table  allow  the  definition  of  multiple 
plotting  surfaces  with  control  over  their 
priority  (color  overplotting/mixing  control) 
and  visibility.  To  rapidly  display  a  chosen 
subset,  the  selected  subset  can  be  written  in  a 
higher  priority  (color  overwrite)  surface. 
This  is  also  a  compromise.  A  better  display 
would  also  distinguish  regions  by  a  third 
color  when  selected  points  overplot  points  in 
the  complement  set.  This  three-color  plot 
unfortunately  requires  replotting  all  data. 
One  approach  is  to  work  initially  with  the 
fast  display.  The  color  mixture  version  can  be 
v/ritten  in  hidden  surfaces  as  a  background 
process.  When  the  color  mixture  version  is 
complete,  it  can  then  be  substituted  for  the 
approximate  version.  Erasure  or  removal  of  a 
small  chosen  subset  can  also  be  handled  by 
plotting  with  the  background  color  in  a  higher 
priority  surface.  Unfortunately,  when 
overplotting  between  the  chosen  set  and  it's 
complement  is  substantial,  such  plots  become 
unacceptable.  The  only  recourse  seems  to  be 
direct  plotting  of  the  complement  set.  Thus, 
even  color  devices  do  not  solve  all  the  speed 
■ttlenecks  of  brushing. 

For  study  and  comparison  purposes  the 
simultaneous  representation  of  more  than  one 
set  becomes  desirable.  Figures  3a  and  3b  show 
an  example.  After  noting  the  similarity  of  the 
XI  versus  X6  and  XI  versus  X7  plots  in  Exhibit 
3a,  a  pencil  shaped  region  was  selected  in  each 
of  the  two  plots.  The  two  selected  sets  are 
shown  by  using  open  circles  and  filled  squares 
in  Exhibit  3b.  Disjoint  sets  and  apparent 
symmetry  in  the  X6  versus  X7  plot  came  as  a 

surprise.  The  example  suggests  that  the  subset 
selection  tool  should  be  in  the  hands  of  those 
who  understand  the  particle  physics  experiment. 
Then  syriaetry  would  probably  be  taken  as  given 
and  more  subtle  patterns  would  be  of  interest. 

Color  can  be  used  to  handle  more  types  of 
overplotting.  Suppose  two  sets  A  and  B  have 
been  graphically  specified  as  in  Exhibit  3b. 

Depending  on  the  specification,  the  inter¬ 
section,  denoted  A-B,  may  not  be  null.  This 

creates  A  sets  of  points,  A-'B,  B*'A,  A-B  and 
-A*'B,  and  potentially  11  types  (6  pairs,  4 
triples  and  1  quadruple)  of  overplotting. 
Since  eleven  colors  is  too  many  colors  to 
distinguish  rapidly,  we  chose  colors  in 

Exhibit  4  to  represent  subsets  and 
overplotting. 

The  separation  of  the  two  sets  is  then  shown  as 
the  absence  of  yellow.  In  contrast  to  Exhibit 
3b,  the  relationship  of  two  sets  to  the  rest  of 
the  data  is  conveyed  in  a  single  picture. 
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Exhibit  3a:  Particle  physics  data  scatterplot 
matrix.  The  four  variables  (out  of  seven) 
partially  describe  individual  replications  of  a 
high  energy  particle  physics  scattering 
experiment. [1?]  Units  have  been  altered  by 
taking  logarithms  of  absolute  values.  Note  the 
similarity  of  the  two  right  most  plots  in  the 
top  row. 
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Exhibit  4:  Colors  for  Subsets  and  Cverplotting 


When  brushing  is  impractical  and  several  sub¬ 
sets  are  to  be  compared,  another  approach  is 
available.  Each  subset  can  be  written  in  a 
different  plotting  surface.  Then  surfaces  are 
cycled  by  changing  the  color  look-up  table. 
This  film  loop  approach  is  described  in  more 
detail  in  [13,14].  In  genei'al ,  sequences  of 
views  are  easier  to  follow  if  there  is 
continuity  between  views.  In  the  subset 
context,  putting  a  composite  view  between  each 
of  the  subset  views  facilitates  comparison. 


4.  DENSITY  REPRESENTATIONS 

Section  4  discussed  overplotting  for  different 
subsets.  No  distinction  was  made  if  a 
displayed  point  came  from  one  data  point  or 
10000  data  points.  This  section  addresses 
overplotting  of  multiple  points  from  the  same 
set. 

The  three  basic  strategies  in  dealing  with 
overplotting  are  1)  to  plot  open  circular 
symbols  2)  to  alter  the  data  to  reduce 
overplotting  and  3)  to  represent  the  point 
density.  Plotting  open  circular  symbols 
[8]  and  jittering  the  data  [3]  are  helpful 
techniques  for  small  data  sets,  but  are 
Inadequate  for  large  N  plots.  To  represent 
a  large  number  of  points,  some  form  of  density 
representation  is  required. 

In  representing  bivariate  densities  a  common 
approach  is  to  bin  the  data  and  to  indicate  the 
bin  counts.  For  printer  plots  symbols  such  as 
those  in  Exhibit  5  are  often  used  with 
rectangular  binning  regions  that  correspond  to 
space  allocated  for  line  printer  characters. 


Exhibit  3b:  Two  subsets.  Pencil  shaped  sets 
were  selected  in  each  of  the  two  right  roost 
plots  in  the  top  row.  Open  circles  and  filled 
squares  show  the  two  sets.  The  two  sets  have 
no  elements  in  common  and  the  X6  versus  X7  plot 
shows  symmetry. 
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Exhibit  5:  Plotter  Symbols  Representing  Counts 


Since  the  amount  of  ink  used  is  only  remotely 
related  to  the  density,  a  "simple"  visual 
process  cannot  be  .used  in  assessing  and 
comparing  local  densities.  When  the  goal  is 
visual  assessment  of  local  density,  two 
approaches  are  available;  either  symbol 
Intensity  represents  the  count,  or  symbol  area 
represents  the  count. 

4.1  Interactive  Gray  Scale  Density 
Representations 

Using  gray-scale  intensity  to  show  counts  is  a 
common  practice  in  the  field  of  image  process¬ 
ing  [15,16],  which  offers  the  opportunity  for 
real  time  exploration  based  on  data  density. 
The  process  starts  by  binning  data  (see  Exhibit 
6a)  into  say  a  256  X  256  matrix.  Then  the 
density  estimate  is  often  smoothed  using  fast 
algorithms  such  as  the  shifted  histogram  [16, 
17,18]  or  the  Fast  Fourier  Transform  (FFT). 
Next,  the  image  is  written  into  display  device 
menwry.  This  Involves  assigning  elements  of 
the  matrix  to  specific  pixels  and  the  trans¬ 
formation  of  the  estimated  density  to  discrete 
pixel  values.  The  correspondence  from  pixel 
values  (density)  to  gray  scale  can  be 
ranipulated  in  real  time  via  the  color  look-up 
table.  Exhibit  6b  shows  an  image  corresponding 
to  a  different  transfer  function  between  the 
pixel  values  (horizontal  axis)  and  the  grey 
scale  (vertical  axis).  The  menu  at  right 
provides  options  for  altering  the  transfer 
function  and  for  contouring  based  on 
interactively  chosen  density  levels,  A  fast 
way  to  find  the  density  levels  is  to  create  a 
spike  transfer  function  (Exhibit  6c)  that  can 
be  moved  left  and  right  with  a  mouse.  The 
corresponding  rough  contours  are  shown  in  real 
time. 


Exhibit  6a;  Binned  Particle  Physics  Data.  The 
data  is  described  in  Exhibit  3a.  Here,  the 
first  two  variables  have  been  binned  into  a  256 
X  256  matrix. 


Exhibit  6b.  A  Gray-Scale  Density  Representa¬ 
tion.  The  transfer  function  at  the  bottom 
defines  the  correspondence  between  density  and 
the  gray  level  for  each  pixel.  The  transfer 
function  can  be  interactively  manipulated  to 
call  attention  to  different  density  regions. 


Exhibit  6c.  "Empirical”  Contours.  Moven«nt  of 
a  spike  transfer  function  allows  real-time 
investigation  of  different  "empirical" 
contours. 


In  our  implementation,  the  density  estimation 
and  interactive  transfer  function  manipulation 
routines  were  added  as  functions  to  S,  a 
statistical  package  from  ATAT  Bell  laborator¬ 
ies. [19]  Chosen  contour  levels  are  easily 
passed  on  to  a  contouring  routine  in  S.  Other 
variations  of  the  transfer  function  can  be  used 


to  call  attention  to  high-  or  low-density 
regions.  Gray  scale  density  images  can  also  be 
examined  rapidly  for  scatterplot  collections. 

When  the  above  process  is  applied  to  large  N 
problems,  little  changes  except  that  binning 
takes  longer  and  density  estimates  often  have 
a  few  extreme  values.  When  the  range  of  pixel 
values  is  limited  to  say  256,  a  linear  mapping 
from  the  density  estimates  results  in  low 
gray-scale  resolution  for  low  density  regions. 
S  can  be  used  to  provide  square  root  or  other 
transformation  before  the  density  is  trans¬ 
ferred  to  display  device  memory.  A  useful 
approach  in  density  representation  is  to  treat 
densities  above  (or  below)  a  specified  value 
the  same.  We  call  this  "blunting",  and  provide 
it  thru  the  menu  under  the  item  rescaling  Z. 
Thus  gray-scale  methods  are  well  suited  for 
handling  large  sample  sizes. 

4.2  DENSITY  REPRESENTATION  BY  SYMBOL  AREA 

The  use  of  area  is  another  choice  for  direct 
visual  representation  of  the  local  density. 
Area  is  not  perceived  as  accurately  as  several 
other  visual  variables  [20]  but  in  this  context 
area  provides  a  reasonable  choice.  The 
technique  of  representing  density  v/ith  area  can 
be  found  in  various  guises  in  the  literature. 
One  variant  [3,8]  is  shown  in  Exhibit  7a.  In 
this  variant,  the  binning  region  is  a  square 
and  the  symbol  is  a  sunflower.  The  number  of 
petals  of  the  sunflower  indicates  the  number  of 
points  in  the  region.  Except  for  regions  with 
a  single  point  which  are  represented  with  a 
dot,  and  except  for  overplotting  of  line 
segnietits,  the  amount  of  ink  used  (plotting 
area)  by  the  symbol  is  proportional  to  the 
counts.  When  such  a  plot  is  compared  to 
Exhibit  7b  which  uses  hexagonal  bins  or  to 
original  data  in  Exhibit  7c,  point  locations  in 
the  sunflower  plot  with  square  bins  appears 
stretched  out  in  the  vertical  and  horizontal 
directions.  The  hexagon  bins  seem  to  represent 
the  data  more  faithfully.  Bin  shape  is  a  type 
of  tv/o  dimensional  smoothing  parameter.  The 
density  estimate  bias  reduction  for  using 
hexagons  instead  of  squares  is  approximately  4 
percent. [21]  This  would  not  account  for  the 
large  visual  discrepancy  in  Exhibits  7a  and  7b. 
The  exact  placement  of  the  binning  lattices  can 
make  a  difference.  However,  the  discrepancy 
most  likely  results  from  emphasis  of  human- 
preferred  visual  directions,  horizonal  and 
vertical,  by  square  bins  and  round  symbols 
within  the  bins.  For  this  reascn,  we  prefer 
hexagon  bins.  The  additional  cost  for  hexagon 
binning  is  small  as  can  be  seen  by  the 
algorithm  in  the  appendix.  Given  that  hexagon 
bins  are  co  be  used,  the  next  question  concerns 
the  symbol.  We  prefer  a  filled  hexagon  whose 
area  is  proportional  to  the  count,  as  shown  in 
Exhibit  7d.  This  provides  a  general  Impression 
of  density.  Some  may  complain  that  the  exact 
count  has  been  lost.  In  an  interactive 


environment,  if  one  really  wants  to  know  the 
exact  count,  the  best  procedure  is  graphical 
selection  of  the  desired  area  and  a  query  to 
the  computer,  "how  many?".  Exhibit  7d  differs 
from  other  area  representations  in  that  points 
are  shown  exactly  when  3  or  fewer  points  fall 
in  a  region.  A  more  difficult  variant  is  to 
plot  each  hexagon  symbol  as  close  as  possible 
to  the  center  of  mass  within  each  hexagonal  bin 


Logten  Sodium 


Exhibit  7a:  Sunflowers  in  square  bins.  Data 
are  paired  sodium  and  chloride  ion 
concentration  measurements  in  logarithms  of 
micro-moles  per  liter  from  individual  rain 
samples  collected  at  Acid  Deposition  Site  152A 
—  Indian  River,  Delaware. [7]  Mote  the  visual 
impact  in  horizontal  and  vertical  directions. 


liOgten  Sodium 


Exhibit  7b:  Sunflowers  in  hexagon  bins.  The 
binning  lattice  still  detracts,  but  the  plot 
looks  closer  to  that  in  Exhibit  7c. 


exhibit  7c:  Original  data.  Actual  overplotting 
is  not  substantial. 


Exhibit  7d:  Hexagon  symbols  in  hexagon  bins. 
The  hexagon  symbol  area  is  portional  to  the 
count  in  each  hexagonal  bin.  For  bins  with 
three  or  fewer  counts,  individual  points  are 
represented  by  single-count-sized  hexagons, 
and  are  plotted  at  data  coordinates.  Thus, 
modest  overplotting  is  tolerated. 


important  aspect  is  that  regions  of  high 
density  are  now  evident.  When  hexagons  are 
written  into  display  device  memory,  pixel 
values  can  be  assigned  that  correspond  to  count 
intervals.  The  color  look-up  table  can  be  used 
to  alter  the  displayed  intensity  in  real  time, 
just  as  in  the  gray-scale  discussion. 
Alternatively,  a  few  easily  distinguished 
colors  can  be  used  to  call  out  selected  density 
intervals. [15]  When  the  number  of  points  is 
large  density  scaling  becomes  an  issue.  The 
maximum  hexagonal  area  displayed  corresponds  to 
the  largest  count  and  just  fills  its  bin.  With 
this  fixed  point  and  the  area  being 
proportional  to  the  count,  low  density  symbols 
become  smaller  than  device  resolution.  In  such 
cases  a  single  pixel  can  be  shown  as  in  Exhibit 
8.  As  in  Section  4.2,  this  Identical  treatment 
for  a  range  of  densities  is  called  blunting. 
The  blunting  of  high  densities  is  also  useful 
as  are  other  transformations  that  emphasize 
selected  portions  of  densities,  for  example  low 
count  regions.  For  binned  data,  transforming 
counts  and  redisplaying  them  is  a  rapid 
operation.  Other  procedures  such  as  smoothing 
can  be  adapted  to  binned  data  with  great 
computational  savings  and  little  loss  of 
accuracy. 

4.3  COMPARISON  OF  SCATTERPLOT  MATRICES  USING 
HEXAGON  SYMBOLS 

Two  scatterplot  matrices  can  be  compared  by 
juxtaposition.  Lower  left  and  upper  right 
triangles  can  contain  two  distinct  but  cotmionly 
scaled  subsets. [22]  Variable  order  is  reversed 
for  one  data  set  so  that  no  mental  rotation 
about  forty-five  degree  lines  is  required  for 
comparison.  However,  it  is  still  desirable  to 
place  corresponding  plots  closer  together.  The 
hexagon  representation  above  allows  this  to  be 
done  by  overplotting.  Suppose  two  data  sets 
are  to  be  compared.  Hexagon  lattice  points  for 
the  two  data  sets  can  be  made  identical  by 
selecting  a  common  scale  and  using  the  same  bin 
size.  If  one  data  set  is  considered  the 
reference  data  set,  counts  of  the  other  can  be 
scaled  so  that  the  total  counts  are  the  same 
for  the  two  sets.  Then  the  two  displays  can  be 
overplotted,  one  set  in  red,  one  in  green  and 
overlap  in  grey.  This  maintains  the  scatter¬ 
plot  matrix  context  and  makes  scanning  for 
differences  easy.  Other  displays  can  be 
considered  such  as  direct  display  of  functions 
of  non-zero  and  non-infinite  count  ratios. 


while  keeping  the  symbol  completely  inside  the 
bin.  Since  even  the  hexagon  lattice  structure 
can  be  distracting,  approaches  that  break  it  up 
are  worth  considering.  Exhibit  8  shows  a  hexa¬ 
gon  density  representation  of  the  original  data 
plotted  in  Exhibit  1.  Note  that  the  number  of 
symbols  actually  plotted  is  much  less  than  that 
in  the  original.  Thus,  this  display  can  be 
produced  on  a  pen  plotter.  Of  course  the  most 


4.4  LOOSE  ENDS 

A  thorough  treatment  of  density  representations 
should  include  results  concerning  smoothing 
parameters  for  density  estimation,  human 
perception  of  density  representations,  and 
should  balance  these  in  view  of  plot  production 
speed  and  device  resolution.  In  this 
presentation  only  a  few  pointers  to  the 
literature  are  given.  Scott  [23]  provides 


Exhibit  8;  Hexagon  firea  Density  Representation.  The  area  of  each  hexagon  is  proportional  to  the 
count.  The  largest  hexagon  fills  hexagonal  binning  regions.  Small  counts  are  represented  by 
degenerate  hexagons.  Below  a  certain  count,  all  counts  are  represented  by  a  single  square  pixel. 
This  is  an  example  of  blunting.  Regions  of  high  density  can  be  readily  identified. 


results  concerning  the  choice  of  smoothing 
parameters  for  bivariate  densities  and  is  a 
portal  to  the  general  literature.  In  terms  of 
gray-scale  perception,  numerous  references  are 
available. [15, 16]  In  assessing  the  response  to 
circle  sizes,  some  studies  show  that  humans 
respond  to  area  raised  to  the  0.7  power,  but 
when  the  comparison  areas  are  in  view,  there  is 
little  reason  to  use  other  than  a  linear 
correspondence  between  the  variable  represented 
and  the  area  of  the  circle. [24]  Presumably 
this  applies  to  hexagons  also.  With  small 
hexagons,  more  is  involved  than  the  comparison 
of  two  areas.  As  dots  approach  70  per  inch 
as  when  viewed  from  12  inches  away  it  becomes 
possible  to  respond  to  gray  level  even  though 
individual  dots  are  visible. [25]  Our  hexagon 
centers  are  further  apart  than  this,  but  in 
regions  where  displayed  hexagons  are  separated 
by  roughly  0.01  inches  (our  raster  display 
device  resolution),  some  gray-scale  impression 
is  induced.  Higher  resolution  devices  such  as 
laser  printers  provide  opportunity  to 
capitalize  on  the  human  impression  of  gray 
scale  as  conveyed  through  area  symbols. 

With  the  focus  on  large  N,  the  question  arises, 
"Why  not  sample?"  The  answer  is  that  there  are 
trade-offs.  Certainly  sampling  reduces  display 
problems  and  provides  cross  validation 


opportunities.  However,  sampling  can  hide 
aspects  of  fine  structure  that  do  not 
necessarily  get  hidden  by  binning.  An 
advantage  of  large  N  is  that  patterns  begin  to 
emerge  in  low-density  regions  of  data.  It  is 
precisely  these  low-density  patterns  that  are 
destroyed  by  sampling.  Another  argument 
against  sampling  is  that  obtaining  represent¬ 
ative  samples  when  pooling  over  temporal  and 
spatial  strata  can  require  substantial  work. 
Thus,  both  small  M  sample  plots  and  large  N 
plots  have  merits. 

5.  SUHMARY 

The  display  of  a  large  number  of  points  in  a 
scatterplot  matrix  has  been  a  problem.  The 
problem  manifests  itself  in  terms  of  hidden 
point  density,  long  computation  times  for 
selected  enhancement  operations,  and  slow 
displays.  The  difficulties  are  ameliorated  by 
computing  and  displaying  densities.  One 
density  representation  codes  density  as 
gray-scale.  With  real-time  graphical 
manipulation  of  a  color  look-up  table, 
attention  can  be  focused  on  different  density 
regions  and  real-time  "empirical"  contours  can 
be  obtained.  A  second  density  representation 
codes  density  as  the  size  of  hexagon  symbols 
as  sliown  within  hexagonal  binning  regions. 


C  Compute  Two  Candidate  Lattice  Points 


Both  approaches  are  useful  in  the  context  of 
scatterplot  matrices  and  scatterplot 
collections  where  density  exploration,  subset 
selection  and  subset  representation  are  of 
interest.  The  representation  of  a  million  or 
more  points  in  each  plot  is  feasible. 

For  subset  selection  in  the  large  N  context, 
real-time  brushing  is  not  feasible  and  polygon 
selection  is  the  method  of  choice.  Since 
display  times  are  less  than  real  time,  the 
color  look-up  table  can  be  used  to  store 
different  subset  displays  for  real-time  review 
or  to  display  subsets  simultaneously,  with 
careful  control  of  color  mixing.  Thus,  the 
color  look-up  table  is  a  useful  tool  in  the 
context  of  large  N. 
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7.  APPENDIX  -  ALGORITHM  SKETCH  FOR  HEXAGONAL 
BINNING 

Binning  algorithms  are  available  for  various 
lattices  in  dimensions  two  through  eight. [26] 
Ihe  following  sketches  a  fast  implementation  of 
the  hexagonal  binning  motivated  in  [26]. 
Details  like  reading  data,  scaling  data, 
handling  missing  data,  zeroing  count  arrays, 
and  checks  for  pathological  conditions  are  left 
to  the  implementor.  Since  the  hexagons  are 
typically  envisioned  in  a  (square)  plotted 
version  of  data,  raw  data  coordinates  are 
presumed  scaled  into  [0,1]  and  resulting 
vectors  of  length  N  are  denote  SX  and  SY.  The 
original  minimums  and  ranges  are  given  by  XMIN, 
YMIN,  XR,  and  YR.  SIZE  is  a  user-specified 
scaling  parameter  that  indicates  roughly  the 
number  of  bins  along  the  X  axis. 


C  SIZEMAX=100 

C  IHAX=SIZEMAX/SQRT(3.)+1 

C  jriAX=SIZEMAX+l 

PARAMETER  IMAX=58,JMAX=101 ,NMAX=10000 
DIMLNSION  SX(NMAX),SY(NMAX) 

INTEGER  LATl (0 :  IM/iX  ,0 ; JHAX )  ,LAT2 (0 :  IHAX ,0 
*  :JHAX) 

C1=SIZE/SQRT(3.) 

DO  K=1,N 
X=SIZC*SX(K) 

Y=C1*SY(K) 


J=X+.5 

I=Y+.5 

02=X 

IZ=Y 

C  Select  the  Nearest 

IF(  (X-J)**Z  +  3.*(y-I)**2  .LT. 

*  (X-J-.5)**2  +  3.*(Y-I-.5)**2)  THEN 
LAT1(I,J)=LAT1(I,J)+1 
ELSE 

LAT2(I2,J2)=LAT2(I2,J2)+1 

ENDIF 

ENDDO 

C  The  Lattice  points  in  original  data 
C  coordinates  for  non-zero  counts  in  LATl 
C  and  LAT2  are  obtained  as  follows: 

C  CONSTANTS 

C  C2=SQRT(3.)*YR/S1ZE,  C3=XR/S1ZE 

C  LAT1(1,J) 

C  Y=C2*I+YM1N,  X=C3*J+XMIN 

C  LAT2(I,J) 

C  Y=C2*(I+.5)+YNlN,  X=C3*( J+.5)+XMlN 
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STATUS  OF  THE  MBS  GUIDE  TO  AVAILABLE  MATHEMATICAL  SOFTWARE 
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The  Guide  to  Available  Matheraatlcal  Software  (GAMS)  is  a  classification  scheme,  a  data 
base  system,  and  a  printed  catalog*  GAMS  provides  a  framework  for  both  the  end-user 
scientist  and  the  software  maintainer  to  handle  large  quantities  of  mathematical  and 
statistical  software* 

The  extensive  problem-oriented  GAMS  classification  sdieme  provides  a  structure  for 
organizing  software  for  general  purpose  mathematical  and  statistical  computations*  The 
software  currently  cataloged  in  GAMS  consists  of  approximately  2400  programs, 
subprograms,  and  interactive  systems  in  some  two  dozen  libraries*  These  libraries  are 
available  on  a  variety  of  computers*  Data  about  the  software  and  about  library 
availability  are  stored  in  a  relational  data  base  and  are  maintained  using  a  variety  of 
software  tools*  Users  access  the  data  via  an  on-line  query  system  based  on  the 
classification  scheme*  The  printed  GAMS  catalog  organizes  information  about  the 
software  according  to  the  classification  scheme  and  in  several  other  useful  ways* 


1*  IITTRODUCriOlf 

A  vast  body  of  reliable  and  well-designed 
computer  software  for  solving  many  standard 
mathematical  and  statistical  problems  now 
exists*  This  software  is  a  crucial  resource  for 
scientific  computing  through  saving  time  and 
money,  expanding  the  scope  of  problems  vMich  can 
be  routinely  solved  by  applied  scientists,  and 
insuring  that  the  most  up-'to-date  and  reliable 
numerical  algorithms  are  used* 

Collections  of  mathematical  and  statistical 
software  are  now  available  in  many  scientific 
computing  centers*  These  collections  are  often 
large  and  diverse,  and  thereby  create  several 
software  management  problems,  including 
acquisition,  maintenance,  and  documentation* 

TTie  NBS  solution  to  these  problems  is  the  result 
of  the  Guide  to  Available  Mathematical  Software 
(GAMS)  project.  GAMS  is  joint  work  with  Ronald 
F*  Boisvert  and  David  K*  K^aner  and  consists  of 
several  components*  We  first  acquired  an 
extensive  collection  of  generally  available 
mathematical  and  statistical  software*  We 
developed  a  problem-oriented  tree-structured 
detailed  classification  scheme  to  Identify  the 
problems  the  software  solved*  The  organlzatlou 
of  the  software  on  the  computer  facilitates  its 
efficient  Installation  and  maintenance*  The 
software  documentation  takes  the  form  of  an 
inter-library  reference*  The  on-line  and  the 
off-line  documentation  have  consistent 
structure*  Finally,  a  single  data  base  was 
developed  to  integrate  the  maintenance  and 
documentation  functions* 

The  purpose  of  this  paper  Is  to  describe  In 
detail  the  software  management  problems  and  the 
solutions  developed  at  the  National  Bureau  of 


Standards*  Information  about  how  NBS  solved 
these  problems  may  prove  useful  to  other 
scientific  computing  centers*  Other  papers 
describing  earlier  versions  of  this  work  have 
focused  on  statistical  software  (Howe,  1982, 
1983)*  The  focus  of  this  paper  is  not  so  much 
the  supported  software  but  rather  how  that 
software  support  is  provided* 

This  paper  begins  with  a  brief  description  of 
the  computing  environmeiit  at  NBS,  primarily 
because  our  environment  has  influenced  many  of 
our  decisions*  The  features  of  mathematical  and 
statistical  software  from  a  malntainer's  point 
of  view  are  then  described,  followed  by 
descriptions  of  the  GAMS  classification  scheme 
for  mathematical  and  statistical  software,  and 
both  the  printed  and  the  on-line  Guide  to 
Available  Mathematical  Software  catalogs*  The 
paper  concludes  with  an  overview  of  our 
Implementations* 


2*  THE  SCIENTIFIC  COMPUTING  ENVIRONMENT  AT  NBS 

The  National  Bureau  of  Standards  (NBS)  is  a 
multl-dlsclpllnary  scientific  researdi 
laboratory  with  a  staff  of  3000*  Its  mission 
leads  to  theoretical  and  experimental  research 
In  the  physical  and  engineering  sciences  for  the 
purpose  of  providing  the  measurement  foundation 
needed  by  U*S*  science  and  industry* 

Computers  at  NBS  include  a  recently  acquired 
Cyber  180/855  and  Cyber  205,  and  minicompi’ters , 
microcomputers,  and  workstations  numbering  in 
the  hundreds*  Many  of  these  computers  and 
terminals  are  interconnected  via  NBSNET,  an 
Ethernet-like  local  area  network* 


The  CeaCer  for  Applied  Mathematics  is 
responsible  for  providing  software  for  use  on 
NBS  computers,  and  for  informiug  the  user 
community  of  both  the  availability  of  such 
software  and  the  information  necessary  to  use 
the  software*  Our  efforts  to  date  have  focused 
on  acquiring,  maintaining,  and  documenting 
general-purpose  mathematical  and  statistical 
software  which  is  useful  in  the  scientific 
disciplines  represented  at  NBS*  This  focus 
restricts  our  attention  to  approximately  ten 
thousand  items,  comprising  perhaps  ten  percent 
of  the  total  available  scientific  software* 
Approximately  seven  staff  years  have  been 
devoted  to  bringing  the  project  to  its  current 
state  In  whidh  approximately  2400  user-lnvokable 
software  Items  are  managed* 

3.  A  SOFTWARE  BASS 

3*1  Software  Acquisition 

Widely  available  general-purpose  scientific 
software  commonly  takes  the  form  of  either 
Fortran  subprograms  or  Fortran  programs,  the 
latter  often  with  tailor-made  input  languages* 
An  early  development  in  statistical  computing 
was  batch-oriented  Fortran  programs  which  did 
not  require  users  to  be  Fortran  programmers* 
Current  versions  of  these  programs  have 
sophisticated  languages  specifically  targeted  to 
statistical  data  analysis*  the  more  recently 
developed  interactive  programs  have  similarly 
capable  and  sophisticated  input  languages* 

The  foundation  for  the  development  of  special- 
purpose  batch  and  interactive  programs  has 
always  been  Fortran  subprograms,  because 
subprograms  are  the  single  most  important  source 
of  implementations  of  state-of-the-art 
computational  algorithms*  The  software  provided 
at  a  research  laboratory  such  as  NBS  therefore 
necessarily  is  an  extensive  collection  of  both 
Fortran  subprograms  and  programs  to  support  the 
needs  of  its  users* 

The  software  we  have  acquired  is  either 
proprietary,  non-proprietary,  or  a  mixture  of 
the  two*  Leasing  of  proprietary  software 
libraries,  with  their  extensive  capabilities  and 
ease  of  installation,  provides  a  firm  foundation 
for  mathematical  and  statistical  computing* 

Non-proprietary  software  is  usually  produced  in 
university  or  government  laboratories  as  a 
product  of  researdi  in  numerical  or  statistical 
methods*  This  software  consists  of  some 
narrowly-focused  subprogram  collections  such  as 
UNPACK  (Dongarra,  et  al.,  1979)  and  a  plethora 
of  single-purpose  programs  or  subprograms  from 
many  authors*  There  are  no  restrictions  on 
installing  such  software,  which  makes  it 
particularly  desirable  in  a  multi-machine 
environment*  Substantial  effort  may  be 
required,  however,  to  install  and  test  the 
software  on  a  particular  machine  and  to  provide 
even  the  most  basic  support* 


Recently,  software  with  non-proprietary  and 
proprietary  components  has  appeared*  The  non- 
proprietary  component  commonly  performs  the 
mathematical  calculations,  while  the  proprietary 
software  is  a  graphics  software  library* 

Sources  for  software  documented  in  GAMS  include 
distribution  services.  Journals,  and  books* 
Statistical  software  is  announced  or  published 
in  periodicals  sudi  as  The  American 
Statistician,  Journals  sudi  as  Communications  in 
Statistics,  proceedings  from  conferences  such  as 
the  Symposium  on  the  Interface  and  the  Joint 
Statistical  ?-1eetings,  and  books  (e*g*,  Francis 
(1981))*  Software  may  also  be  available  from 
Individual  authors* 

3*2  Software  Organization 

Over  the  past  several  years  we  have  collected 
software  from  numerous  sources  and  organized  it 
according  to  a  two-level  system*  At  the  higher 
level  are  libraries  and  at  the  lower  level  are 
modules  which  are  collected  Into  libraries*  A 
module  is  the  smallest  user-callable  problem¬ 
solving  unit,  and  may  be  an  individual  user- 
callable  subprogram,  an  individual  batch  or 
interactive  program,  or  a  command  In  a  large 
Interactive  program* 

Software  from  a  given  vendor  usually  is  a  large 
collection  of  subprograms  or  programs,  or  a 
large  interactive  program,  and  is  kept  in  a 
library  of  Its  own*  One  special  library  is  the 
NBS  Core  Mathematical  Library  (CMLIB),  a  library 
\^ich  is  partitioned  into  approximately  30 
sublibraries  of  Fortran  subprograms  obtained 
from  numerous  sources*  While  CMLIB  Is 
partitioned  in  order  to  maintain  the  individual 
sublibraries,  from  the  user's  point  of  view  it 
is  one  very  large  library  of  portable,  non¬ 
proprietary  software* 

The  GAMS  project  currently  manages  approximately 
2400  modules  in  20  libraries r 


BMDF**«**«*40  statistical  programs 

QfLlB**** *678  mathematical  and  statistical 
subprograms 

DATAPAC***169  statistical  subprograms 

ISML* • *  * • *47 1  mathematical  and  statistical 
subprograms 

INVAR*******2  interactive  regression  programs 

L1N00«******1  linear  programming  program 

MATHWAR£.**33  mathematical  and  statistical 
subprograms 

MATLAB. *  * • » *  I  interactive  linear  algebra  program 

MINITAB***130  commands  In  an  interactive 
statistics  program 

NAG* • • • • • *467  mathematical  and  statistical 
subprograms 

POGLIB* *  * • *  *  3  partial  differential  equations 
subprograms 

PLOD******«*l  interactive  ordinary  differential 
equations  program 

PORT***«**270  mathematical  and  statistical 
subprograms 


ROSEPACK* • • *1  Interactive  robust  regression 
program 

S1NSCR1PT.««1  simulation  language 
SLDGL»*.*«.3l  ordinary  differential  equations 
subprograms 

SPECTRLAN***!  interactive  spectral  analysis 
program 

STARPAC*«.«16  nonlinear  regression  subprograms 
STATLIB*.**56  statistical  subprograms 
XMPLIB, • • • • .2  mathematical  programming  sub- 
programs 

For  each  user'~callable  (or  executable)  module  we 
must  maintain  the  source  code  for  that  module 
and  for  all  non-user-callable  modules  fAiich  that 
module  references,  object  (either  relocatable  or 
executable)  code,  on-**line  documentation,  and 
(optionally)  test  code  and  results*  Vftiere 
appropriate,  the  on-line  documentation 
references  printed  documentation  such  as 
manuals*  All  of  these  items  now  number  well 
over  10,000* 

4,  GAMS:  THE  GUIDE  TO  AVAILABLE  MATHEMATICAL 
SOFTWARE 

End-users  of  scientific  software  are  interested 
in  locating  software  to  solve  particular 
problems,  and  are  not  interested  in  how  the 
software  is  organized  for  raaintenance*  The 
Guide  to  Available  Mathematical  Software  (GAMS) 
provides  Such  end-users  with  a  problem-oriented 
inter-library  software  reference*  This 
reference  takes  the  form  of  both  a  printed 
catalog  and  an  on-line  interactive  guide*  A 
detailed  problem-oriented  classification  scheme 
is  fundamental  to  each  form* 

4*1  The  GAMS  Glassification  Scheme 

While  each  proprietary  library  documented  in 
GAMS  has  its  own  relatively  consistent 
organizational  structure,  none  provides  a 
sufficiently  extensive  and  detailed  structure 
for  organizing  the  whole  CAMS  software 
collection*  We  therefore  have  developed  the 
GAMS  classification  scheme  (Boisvert,  et  al*, 
1983)  to  synthesize  information  about  the 
software  we  support*  This  classification  scheme 
is  a  substantial  modification  of  the  Bolstad 
sdieme  (1975),  which  in  turn  evolved  from  the 
sdieme  adopted  by  the  IBM  user's  group  SHARE* 

The  classes  at  the  highest  level  of  the 
classification  scheme  are: 

A*  Arithmetic,  Error  Analysis 
B*  Number  Theory 

C*  Elementary  and  Special  Functions 
D*  Linear  Algebra 
E*  Interpolation 

F*  Solution  of  Nonlinear  Equations 
G«  Optimization 

H*  Differentiation  and  Integration 
I*  Differential  and  Integral  Equations 
J*  Integral  Transforms 
K*  Approximation 


L*  Statistics  and  Probability 

M*  Simulation  and  Stodiastic  Modelling 

N*  Data  Handling 

0*  Symbolic  Computation 

P*  Computational  Geometry 

Q*  Graphics 

R*  Service  Routines 

S*  Software  Development  Tools 

These  classes  generally  proceed  from  fundamental 
to  more  advanced  topics*  Most  of  these  classes 
are  further  subdivided,  and  in  these 
subdivisions  core  subjects  appear  before 
specializations*  Consistency  has  been  a  goal  in 
developing  the  scheme,  so  that,  for  example, 
univariate  problems  appear  before  multivariate* 

The  development  of  the  present  classification 
scheme  has  been  strongly  influenced  by  the 
software  at  hand.  Experience  has  indicated  that 
projections  about  scientific  software 
organization  in  the  absence  of  such  software 
would  be  highly  error-prone.  Thus  the  level  of 
detail  varies  across  the  scheme*  In  having  at 
most  about  a  dozen  modules  assigned  to  any 
class,  the  scheme  also  reflects  the  compromise 
between  accuracy  and  quantity;  it  would  be 
tedious  either  to  find  a  few  modules  in  a 
detailed  subtree  or  to  find  one  useful  module 
among  many  in  a  class* 

Interrelationships  among  classes  motivated  the 
inclusions  of  cross-references  in  the  scheme* 
Thus,  for  example,  class  L3,  containing  software 
for  probability  function  evaluation,  cross- 
references  class  C  (elementary  and  special 
functions)*  A  module  which  performs  several 
tasks  may  be  assigned  to  multiple  classes;  an 
example  is  spline  approximation*  Some  user- 
callable  subprograms  are  almost  always  used  in 
pairs;  for  reasons  of  efficiency,  when 
documentation  for  one  references  the  other,  then 
only  one  is  classified* 

Eadi  module  is  classified  at  the  lowest 
appropriate  classes*  When  a  module  performs 
tasks  in  several  subclasses  of  a  particular 
class,  however,  it  is  classified  at  a  higher 
level*  This  is  especially  common  with 
statistical  software  and  large  interactive 
programs* 

As  existing  scientific  software  Is  added  to 
GAMS,  and  as  new  software  becomes  available,  the 
classification  scheme  will  undergo  selective 
revision*  Given  its  tree  structure,  however, 
the  scheme  itself  ought  not  to  undergo  radical 
revision  in  the  near  future* 

4*2  The  Printed  GAMS  Catalog 

From  the  user's  point  of  view,  GAMS  manifests 
itself  as  a  printed  catalog  and  an  interactive 
consultant*  The  printed  catalog  is  required  by 
those  who  do  not  use  the  computer  on  which  the 
interactive  GAMS  resides*  These  users  may  well 
include  people  not  at  NBS*  The  most  recent  GAMS 


catalog  was  released  as  a  448'page  NBS  Technical 
Report  (Boisvert,  et  al«,  1984)«  V/hile  we  do 
not  distribute  the  software  documented  in  GAMS, 
the  catalog  contains  the  addresses  of  the 
sources  which  distribute  the  software*  The  GAMS 
catalog  is  available  from  the  author  or  NTIS* 

In  order  to  satisfy  the  needs  of  different 
users,  the  catalog  is  organized  in  five 
sections: 

A*  GAMS  Classification  Sdieme 
B*  Modules  by  Class 
C«  Module  Dictionary 
D*  Library  Reference 
E •  Index 

Modules  by  Class  catalogs  the  software  according 
to  the  classification  scheme*  Under  each  class 
in  the  scheme  is  a  list  of  modules,  including  a 
brief  description  of  each  module  and  the  library 
to  idiich  it  belongs*  For  higher  level  classes 
there  may  also  appear  discussions  of  the  types 
of  software  found  in  those  classes,  along  with 
issues  and  problems  a  user  should  address  when 
selecting  software,  and  references* 

The  alphabetically  organized  Module  Dictionary 
contains  detailed  information  about  each  module 
in  GAMS,  including: 

*  brief  description; 

*  type  (e*g*,  subprogram,  batch  program); 

*  proprietary  or  non-proprietary; 

*  library  (and  sublibrary,  if  appropriate) 
membership; 

*  precision  (single  or  double); 

*  GAMS  class (es); 

*  usage  syntax  (e*g*,  call  sequence,  command 
syntax); 

*  location  of  on-line  documentation  on  an  NBS 
computer; 

*  location  of  source  on  an  NBS  computer  (if  not 
proprietary) ; 

*  (optional)  location  of  test  programs  on  an  NBS 
computer; 

*  (optional)  location  of  sample  programs  on  an 
NBS  computer; 

*  commands  required  to  access  the  module  on  an 
NBS  computer;  and 

*  (optional)  names  of  other  modules  used  with 
this  module* 

The  contents  of  each  library  are  summarized  in 
the  Library  Reference*  First,  the  following 
general  information  is  given: 

*  brief  description; 

*  version; 

*  type  (e*g*,  subprogram  library); 

*  language; 

*  portability  information; 

*  references;  and 

*  developer  and/or  distributor* 

For  eadi  machine  Oii  tdiich  the  library  Is 
supported,  the  following  information  is  given: 


*  version; 

*  level  of  in-house  support  and  a  contact 
person; 

*  how  to  obtain  on-line  documentation;  and 

*  how  to  access  the  library* 

Similar  information  is  given  about  each 
sublibrary  in  the  partitioned  library  CMLXB* 

The  index  alphabetically  organizes  keywords  and 
phrases  with  pointers  into  the  classification 
sd)eme* 

4*3  The  GAMS  Interactive  Consultant 

While  the  specific  details  of  the  on-line 
version  of  GAMS  are  not  of  interest  at  sites 
lAiere  it  is  not  available,  the  general  features 
may  be  of  interest  at  sites  where  a  similar 
capability  is  desirable*  The  main  reason  for 
developing  the  interactive  consultant  is 
timeliness*  Whereas  the  GAMS  catalog  is  printed 
infrequently,  the  on-line  data  are  updated 
regularly,  and  hence  the  interactive  consultant 
provides  current  information* 

A  user  of  the  interactive  consultant  may 
traverse  the  classification  tree*  When  a  node 
of  the  tree  is  visited,  a  count  of  the  number  of 
modules  classified  there  and  a  list  of  the 
descendent  classes  are  obtained*  A  user  may 
then  obtain  information  about  eadi  modules 
(similar  to  that  provided  by  the  Module 
Dictionary)  at  that  node  or  may  move  to  another 
node*  The  consultant  is  made  easy  to  use  by 
having  a  simple  command  syntax  and  internal  help 
fad  lities* 

Users  may  constrain  their  software  search  by 
restricting  attention  to  portable  software,  to 
software  in  a  particular  library,  or  to  software 
which  computes  In  a  particular  precision* 

Once  a  user  has  identified  software  of  interest, 
on-line  detailed  instructions  on  how  to  use  that 
software  is  then  available* 

3*  IMPLEMRNTATION 

The  two  fundamental  components  of  our  software 
support  are  the  maintenance  of  the  software 
itself,  and  the  maintenance  of  information  about 
the  software*  We  have  developed  software  tools 
to  efficiently  manage  our  large  software 
collection* 

3*1  Naming  Conventions 

Developing  naming  conventions  is  the  first  step 
in  automating  software  management*  Such 
conventions  must  necessarily  conform  to  the  file 
structure  of  the  computer  on  which  the  software 
ia  maintained,  and  we  have  developed  conventions 
for  several  computers*  For  the  purpose  of  this 
paper,  however,  our  naming  conventions  are 
illustrated  using  UNIX  path  names* 


A  module's  source  code  Is  in  the  path 

Library-name/SOURCE/Sublibrary-name/Module- 

name. 

Relocatable  (object)  code  is  in  the  path 

NBS/Llbrary-name/Sublibrary-name/Module-narae. 

In  order  to  access  a  module  a  user  need  only 
know  the  library  in  which  it  resides* 

A  similar  format  is  used  for  the  location  of  the 
documentation  for  an  individual  module^  a 

sublibrary,  and  a  library,  respectively,  as 
follows: 

Llbrary-name/t)OC/Sublibrary-name/Module-name 

Llbrary-name/DOC/Sublibrary-narae/SUMMARY 

Library-narae/DOC/SUMMARY. 

Naming  conventions  for  test  software,  test 
results,  reference  materials,  and  other 

information  are  more  variable* 

5*2  Software  Management  Procedures 

Three  categories  of  frequently-occurring 
software  maintenance  activities  have  warranted 
the  development  of  software  tools*  The  first  is 
software  installation,  and  tools  have  been 
written  for  Fortran  source  dispersion, 

documentation  extraction,  and  Fortran  library  or 
sublibrary  compilation*  The  second  category  is 
documentation  retrieval,  for  ^ich  there  exist 

tools  to  extract  module,  sublibrary,  and  library 
documentation*  Finally,  tools  have  been  written 
to  prepare  individual  Fortran  subprograms  (and 
externals),  Fortran  sublibraries,  and  CMLIR  for 
redistribution* 

5*3  The  GAMS  Data  Base 

A  single  relational  data  base  was  diosen  to 
support  both  software  maintenance  and 
documentation  functions*  This  type  of  data  base 
was  diosen  for  its  simplicity  and  flexibility* 
It  is  used  to  access  the  data  in  several  ways 
(e*g*,  to  organize  modules  either  according  to 
the  classification  scheme  or  alphabeticaliy  for 
documentation  purposes  and  to  organize  the 
modules  by  library  for  maintenance  purposes)  and 
thus  is  used  to  develop  the  printed  catalog,  to 
drive  the  interactive  consultant,  and  for 
software  maintenance* 

A  relational  database  is  a  collection  of  tables 
called  relations*  The  five  relations  in  the 
GAMS  data  base  are  library,  sublibrary,  module, 
node  (containing  the  GAMS  clasaif ication 
scheme),  and  tree  (containing  pointers  whidi 
describe  the  tree  structure  of  the 
classif ication  sdieme  and  identify  the  modules 


classified  at  each  node)* 

Ead)  relation  is  a  matrix  in  which  rows  are 
cases  and  columns  are  attributes*  The 
attributes  in  the  library  relation  are  those 
itemized  for  the  Library  Reference  (see  section 
4*2)*  The  attributes  in  the  sublibrary  and 
module  relations  are  similar* 

The  GAMS  data  base  was  constructed  using  RIM 
(1982),  a  relational  Information  management 
system  developed  by  the  Boeing  Commercial 
Airplane  Company  under  contract  to  NASA*  RIM 
provides  both  an  interactive  query  system  and  a 
Fortran  interface*  The  interactive  query  system 
is  used  to  monitor  the  contents  of  the  data 
base*  The  RIM  applications  program  interface  is 
a  set  of  Fortran  subprograms  which  may  be  used 
to  load  data  into  the  database  or  to  retrieve 
data  from  it*  It  has  been  used  to  develop  the 
specialized  programs  which  access  the  GAMS  data 
for  the  interactive  consultant,  for  the 
production  of  the  printed  GAMS  catalog,  and  for 
database  maintenance* 

6.  DISCUSSION 

The  central  purpose  of  the  GAMS  project  has  been 
to  provide  in  integrated  system  of  documented 
software*  The  integration  has  been  across  many 
features,  not  the  least  of  which  is  the 
consolidation  of  mathematical  and  statistical 
software  under  one  umbrella*  This  consolidation 
facilitates  the  communication  among  numerical 
analysts,  statisticians,  and  other  scientists 
involved  in  both  software  development  and  usage* 
While  the  target  audience  for  GAMS  products  has 
been  NBS  scientists,  the  work  has  had  broader 
interest*  Of  particular  interest  to  the 
statistical  computing  community,  the  Committee 
on  Statistical  Algorithms  of  the  Statistical 
Computing  Section  of  the  American  Statistical 
Assocation  has  been  Involved  with  the 
development  of  that  portion  of  the 
classification  scheme  dealing  with  statistical 
computations*  Current  issues  under  discussion 
by  the  committee  include  providing  to  the 
general  statistical  computing  community  a  more 
general  version  of  GAMS*  This  version  would  on 
Che  one  hand  not  contain  site-specific 
information,  and  on  the  other  would  reference 
software  not  available  on  a  particular  computer 
(e*g*,  listings  in  Journals)*  Of  course.  It 
would  be  desirable  to  have  much  of  this  software 
provided  through  a  distribution  service* 
Substantial  effort  may  be  required  to  modify  the 
software  for  portability,  prepare  on-line 
documentation,  and  prepare  test  software 
de8ign*»d  to  efficiently  test  whether  or  not  the 
software  has  been  properly  installed* 

NBS  Is  currently  undertaking  the  task  of 
converting  all  of  its  central  computing  to  the 
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Cyber  855  and  Cyber  205*  The  GAMS  data  base 
will  be  maintained  on  a  VAX  11/785.  As  part  of 
these  conversions,  the  GAMS  software  has  become 
more  flexible  and  more  portable.  The  presence 
of  these  large  computers  has  also  motivated 
software  consolidation,  and  as  a  result,  more 
software  will  be  supported  and  documented 
through  GAMS,  Current  efforts  Involve  adding 
SPSS,  Dataplot,  and  graphics  software  to  the 
GAMS  data  base.  Future  plans  include  further 
modifying  the  data  base  to  fully  distinguish 
among  machines  in  a  muiti-madiine  (computers 
and/or  peripherals,  especially  graphics  devices) 
environment,  and  providing  software  specifically 
vector  computers  and  computers  with  other 
interesting  ardiitectures. 
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In  the  future  widely-used  statistical  software  will  incorporate  many 
more  user-friendly  features  including  some  topics  usually  considered  as 
artificial  intelligence.  We  are  currently  working  on  three  projects  to 
prepare  our  students  for  their  future  work  with  such  software  and  to 
improve  our  teaching  of  data  analysis.  We  are  monitoring  and 
evaluating  user-response  patterns.  The  results  of  this  study  will  be 
some  suggested  enhancements  for  interactive  statistical  software  and  a 
better  understanding  of  how  students  analyze  data.  Other  wcrk  includes 
the  creation  of  a  preprocessor  to  assist  users  in  deciding  the 
appropriate  commands  for  their  analyses  and  a  report  generator  designed 
specifically  for  introductory  applied  statistical  topics. 


Motivation 

In  contrast  to  a  decade  ago,  today  most 
colleges  and  universities  use 
statistical  software  in  their 
introductory  applied  statistics  courses. 
In  the  future  the  use  of  such  software 
will  be  even  more  prevalent.  At  our 
small  teaching-oriented  college  of  1,400 
full-time  undergraduates  and  1,600 
full-time  and  part-time  MBA's,  we  are  no 
exception.  In  each  of  our  introductory 
statistics  courses,  the  students  are 
exposed  to  at  least  one  statistical 
package.  In  the  past  the  most  commonly 
taught  package  was  IDA;  currently  all 
our  students  use  Minitab.  Among  the 
reasons  for  using  such  software  are  the 
ability  to  perform  ca  relations  more 
easily  and  to  produce  more  statistical 
analyses,  including  complicated 
analyses,  state-of-the-art  analyses,  and 
analyses  involving  complex  data 
structures.  It  is  often  interesting  to 
reflect  that  just  a  short  time  ago  there 
were  entire  courses  devoted  to  multiple 
linear  regression  which  contained  much 
the  same  material  now  taught  in 
approximately  one  week. 

Too  few  instructors  who  use  such 
software  recognize  that  there  are 
weaknesses  assocated  with  its  use.  For 
example,  at  many  schools  there  is  only 
one  package  available  or  only  one 
package  taught  so  that  the  package 
dictates  the  material  covered  in  the 
course.  Many  students  in  these 
single-package  courses  do  not  recognize 
that  there  are  other  analyses  which  are 
not  available  on  the  chosen  package. 

Thus  the  students  use  only  that  package 
in  their  subsequent  statistical  analysis 
work.  New  problems  such  as  simultaneous 


number  of  introductory  applied 
statistics  courses,  only  a  cookbook 
approach  to  statistical  analysis  with 
the  computer  is  performed.  That  is,  the 
students  are  taught  that  entering  the 
given  sequence  of  commands  is  the  only 
way  to  handle  a  given  problem.  This 
lack  of  teaching  the  strategy  of 
analysis  is  a  major  weakness  in  the  use 
of  statistical  software  in  the  classroom 
today. 

The  strengths  and  weaknesses  of  using 
statistical  software  in  the  classroom 
will  be  more  pronounced  in  the  future. 
Statistical  packages  will  become  even 
more  powerful  and  user-friendly  in  the 
future.  Statistical  package  developers 
were  slow  to  realize  that  users 
appreciated  such  features  as  is  shown  by 
the  following  quote  from  Francis  (1981, 
p.23)  “The  notable  feature  of  this 
table  is  the  group  entitled  Convenience 
of  User  Language.  Expressions  such  as 
'ease  of  use',  'convenience'  or 
'language'  were  repeated  time  and 
again. " 

But,  today  as  Kay  (1984,  p.54)  mentions 
"The  user  interface  was  once  the  last 
part  of  a  system  to  be  designed.  Now  it 
is  the  first."  This  "softer  software", 
as  Gates  calls  it,  will  probably 
incorporate  some  of  the  features  now 
associated  with  artificial  intelligence. 
(Two  excellent  summary  articles  on  the 
interface  between  artificial 
intelligence  and  statistics  are  Gale  and 
Pregibon  (1985)  and  Haim  et  al  (1985).) 

One  characteristic  of  future  statistical 
software  will  be  a  better  guidance 
system  of  what  analyses  should  be 
performed.  Thus  the  user  will  be 


directed  to  what  command  or  series  of 
commands  should  be  used  in  order  to 
analyze  a  particular  set  of  data. 

Another  future  characteristic  will  be 
domain-generated  output  in  the  form  of  a 
readable  report.  The  end-user  will  not 
have  to  translate  the  package's  output. 
All  he  or  she  will  have  to  do  is  read 
the  package's  executive  summary  of  the 
analysis.  The  incorporation  of  both  of 
these  features  into  future  statistical 
software  will  affect  greatly  the  way 
applied  statistics  is  taught.  In  order 
to  prepare  ourselves  better  for  our 
future  teaching  we  are  creating  both  an 
elementary  preprocessor  for  Minitab  to 
assist  the  students  in  selecting  the 
proper  commands  and  report  generators 
for  some  topics  associated  with  our 
introductory  applied  statistics  course. 
Descriptions  of  our  approaches  to  these 
two  projects  may  be  found  at  the  end  of 
this  paper.  In  order  for  us  to  be 
successful  we  believe  more  information 
is  needed  about  how  students  analyze 
data  using  a  statistical  package. 


for  our  later  experiments,  we  did  gain 
some  insight  into  how  users  employ  a 
statistical  package.  In  addition  a 
discussion  of  this  experiment  provides 
the  basis  for  our  later  work. 

This  experiment  was  given  as  the 
take-home  portion  of  the  final 
examination.  Here  are  the  instructions 
given  to  the  students: 

1.  Examine  supplementary  exercise  14.47 
on  pages  669-670  of  McClave  and  Benson 

(  1982)  . 

2.  Outline  the  Minitab  commands 
necessary  to  complete  this  exercise. 

3.  Perform  these  commands  in  one  (1) 
run  of  the  statistical  package  NEWMINI 
on  a  hard  copy  terminal.  (To  get  into 
NEWMINI,  enter  NEWMINI  at  the  S  prompt.) 

4.  Bring  your  copy  of  this  run, 
including  log-in  and  log-off 
information,  to  the  final  examination. 


Knowledge  about  how  either  students  or 
the  average  user  employ  a  statistical 
package  is  quite  sparse.  Much  work  in 
artificial  intelligence  has  centered 
around  how  experts  feel  how  average 
users  will  employ  such  packages.  Then 
this  expert  information  is  incorporated 
into  the  sytera.  While  we  believe  it  is 
extremely  important  that  the  experts' 
opinions  be  placed  into  future 
statistical  software,  we  also  believe 
that  what  the  average  user  does  with 
such  software  should  also  be 
incorporated.  Thus  our  initial  work  in 
preparing  for  the  arrival  of  the 
statistical  software  of  the  future  is 
the  monitoring  and  evaluation  of 
user- response  patterns  to  existing 
statistical  software.  From  this  work  we 
feel  that  we  will  gain  a  better 
understanding  of  how  our  students  use  a 
statistical  packages  and  thereby  obtain 
a  solid  background  for  beginning  our 
work  on  the  preprocessor  and  report 
generators  mentioned  above.  (An 
immediate  byproduct  will  be  an 
improvement  in  our  teaching  of  such 
software.)  Moreover  we  will  bo  able  to 
provide  some  suggested  enhancements  to 
the  existing  statistical  software. 

Monitoring  and  Evaluating  User-Response 
Pa  t  terns 

Our  initial  work  in  monitoring  the 
user-reponse  patterns  on  a  data  analysis 
problem  was  performed  on  a  small  group 
of  students  taking  a  second  course  in 
applied  statistics.  Although  we 
designed  this  work  to  set  the  foundation 


5.  At  the  final  be  prepared  to  answer 
questions  related  to  this  exercise  which 
may  be  based  upon  you  NEWMINI  run. 

Exercise  14.47  is  a  problem  which 
illustrates  that  the  two  independent 
sample  mean  problems  with  unknown 
variances  can  be  analyzed  in  three  ways. 
Namely,  by  a  pooled  two-sample  t-test,  a 
slope  test  in  a  simple  linear  regression 
model,  and  a  one-way  ANOVA  test.  This 
exercise  from  the  course's  principal 
text  was  selected  so  that  the  students 
would  use  a  variety  of  the  NEWMINI 
commands.  The  data  set  for  this  initial 
experiment  was  presented  as  two  columns 
of  eight  observations.  Thus  it  could  be 
entered  into  the  machine  with  no 
dif  f icul ty . 

NEWMINI  was  just  a  modified  version  of 
Minitab  provided  to  us  by  Hinitab,  Inc. 
When  someone  uses  NEWMINI  a  record  of 
their  command  entries  is  placed  into  a 
file.  This  file  also  included  a  listing 
of  the  Minitab  recognized  errors 
encountered  along  with  numerical  codes 
explaining  these  errors.  NEWMINI  does 
not  keep  track  of  typographical  errors 
which  were  corrected  before  being  sent 
to  the  main  package.  Nor  does  it 
identify  fatal  errors.  The  students 
were  unaware  their  responses  were  being 
monitored . 

Note  that  for  this  first  experiment  we 
asked  the  students  to  perform  only  one 
run.  This  unrealistic  requirement  for 
modern  data  analysis  in  an  interactive 
mode  was  introduced  because  our  main 


Table  1 


goal  was  to  test  how  well  NEVffllNI 
worked.  That  is  also  the  reason  we 
required  the  students  to  turn  in  hard 
copy  runs  of  their  work  on  NEVVMINI.  We 
requested  log-in  and  log-off  information 
for  this  run  in  order  to  guarantee  that 
each  student  performed  the  assignment  on 
his  or  her  own  account. 

At  the  examination  we  collected  the 
student's  hard  copies.  Then  we  compared 
these  runs  with  the  files  created  on 
each  student's  account  by  NEWMINI.  At 
this  point  we  discovered  that  we  did  not 
have  the  expected  one-to-one 
correspondence.  Somehow  a  number  of 
NEWMINI  files  had  disappeared.  This  was 
probably  due  to  a  file  saving  snafu  by 
our  computing  center  at  the  end  of  the 
semester.  Based  upon  this  information 
we  decided  to  automate  further  the 
collection  of  files  generated  by  NEWMINI 
for  our  future  experiments. 

We  wrote  a  program  to  process  the  data 
in  the  18  NEWMINI  files  we  had  obtained. 
This  program  produced  the  following  15 
pieces  of  information:  student  section 
number,  student  ID  number,  session 
number,  type  of  entered  line,  command  ID 
number,  command  string,  number  of 
characters  in  a  line,  number  of  entries 
separated  by  blanks  in  the  line,  whether 
or  not  there  were  subcommands  or  error 
messages  following  the  line,  the  number 
of  Minitab  recognized  errors  along  with 
their  error  codes,  the  arguments 
following  HELP  and  SAVE  commands,  and 
the  number  of  data  lines  following  the 
given  line. 

For  this  initial  experiment  section  and 
session  numbers  were  constant.  Entered 
lines  were  classified  as  to  whether  they 
were  a  valid  Minitab  command,  a  data 
line,  a  Minitab  greeting  line,  or 
something  else  (usually  an  error).  For 
this  initial  experiment  section  and 
session  numbers  were  constant.  This 
information  was  entered  into  Minitab  for 
analysis.  In  this  process  we  created 
additional  variables  from  the  entered 
information.  For  example,  we  created  a 
variable  which  identified  the  command 
class i f i ca t ion  of  each  valid  command 
line.  This  classification  scheme  was 
based  upon  the  20  command  categories 
present  in  the  Minitab  documentation.  A 
preliminary  summary  of  this  first 
experiment  is  presented  under  the  CAO 
(Computer  Assignment  0)  column  of  Table 
1 . 

From  this  table  we  see  that  18  students 
entered  291  lines  in  performing  their  18 
runs.  Most  of  these  lines  were  nondata 
lines.  Of  these  252  lines,  228  were 
valid  command  lines,  7  were  invalid 


Preliminary  Summary 
of  First  Two  Experiments 


Number  of: 

CAO 

CAl 

sections 

1 

4 

students 

18 

150 

runs 

18 

572 

command  lines 

291 

12259 

nondata  lines 

252 

11750 

valid  command  lines 

228 

9787 

invalid  command  lines 

7 

1356 

greeting  lines 

17 

607 

data  lines 

39 

509 

error  lines 

14 

1790 

1  common  error  lines 

14 

1496 

2  common  error  lines 

0 

242 

3  common  error  lines 

0 

11 

4  common  error  lines 

0 

2 

5  common  error  lines 
individual  command 

0 

0 

error  lines 

0 

39 

command  lines,  and  17  were  greeting 
lines.  (The  reason  there  were  only  17 
greeting  lines  is  that  one  student 
emptied  the  contents  of  NEWMINI  file  in 
order  to  free  up  some  file  space.  This 
also  led  to  our  above  mentioned  decision 
to  automate  the  data  collection  process 
in  further  analyses.)  From  Table  1  we 
see  that  the  14  errors  recognized  by 
Minitab  were  relatively  simple  errors  in 
that  only  one  error  appeared  on  each 
error  line.  Thirty-five  percent  of 
these  errors  were  errors  of  improper 
command  name  designation. 

As  mentioned  above,  we  had  originally 
intended  this  experiment  as  a  way  to 
pretest  our  information  gathering 
process.  But  to  our  surprise,  we 
gathered  some  insights  into  both 
potential  software  improvements  and  a 
better  understanding  of  the  data 
analysis  process.  .le  noticed  that  a 
laroe  number  of  these  students  attempted 
to  use  Minitab's  SET  command  to  enter 
two  variables  -t  a  time  even  though  SET 
is  designed  to  enter  one  variable  at  a 
time.  For  example,  SET  Cl  is  valici,  but 
it  is  not  valid  to  say  SET  Cl,  C’’.  In 
addition  few  stude  it  took  advantage  of 
the  horizontal  data  entry  feature 
available  in  SET.  We  also  observed  that 
a  lot  of  these  relatively  advanced 
students  just  restarted  after  they  made 
a  data  entry  error.  This  may  have  been 
due  to  the  small  data  set  but  it  pointed 
nut  to  us  the  potential  wastefulness  of 
repeated  data  entry.  We  now  believe 
that  we  should  demonstrate  more  examples 
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of  data  correction  than  we  currently  do 


Our  overall  observation  was  that  these 
students  had  mastered  data  analysis 
using  an  interactive  statistical  package 
fairly  well.  We  saw  this  by  such  things 
as  the  paucity  of  errors  and  the  fact 
the  two  requested  HELP  commands  were 
followed  by  the  command  on  which  HELP 
was  requested.  This  impression  can  also 
be  seen  in  Table  2  and  Table  3. 

Table  2 

CAO  Expert  Data  Analysis  Flow  Chart 


Table  3 

CAO  Modal  Data  Analysis  Flow  Chart 


Table  2  is  a  data  analysis  flow  chart 
that  we  constructed.  It  is  our  opinion 
of  the  expert's  sequence  of  Hinitab 
commands  that  should  have  been  entered 
in  order  to  answer  exercise  14.47  in  one 


Minitab  run.  This  sequence  shows  that 
the  student  should  first  enter  the  two 
variables,  one  containing  the  data  and 
the  other  indicating  the  appropriate 
sample,  label  these  variables,  and  then 
print  out  the  labeled  variables  for 
verification.  Then  the  student  is  ready 
to  perform  the  requested  three  analyses. 
In  our  opinion  this  is  accomplished  by 
performing  a  two-sample  t  test  with  an 
indicator  variable  (TWOT) ,  plotting  the 
data,  performing  a  regression  after 
requesting  a  complete  set  of  output 
(BRIEF),  and  performing  a  one-way  ANOVA 
with  an  indicator  variable  (ONEWAY). 
Finally  the  student  should  leave  Minitab 
by  STOPping  his  or  her  session. 

Table  3  is  a  data  analysis  flow  chart 
constructed  from  the  students'  modal 
sequence  of  Mini  tab  commands.  In 
contrast  to  Table  2  this  table  also 
includes  the  number  of  students  who 
followed  a  specific  path.  For  example, 
eight  students  started  with  a  READ 
command  while  eight  students  began  their 
run  with  a  SET  command.  After  using 
these  commands  20  students  ended  their 
data  entry  by  issuing  an  END  command. 

The  modal  next  command  was  PRINT.  That 
is,  most  of  the  students  printed  out 
their  data  entries  for  verification. 

From  the  PRINT  command  the  modal  command 
responses  were  TWOSAMPLE  (a  two 
independent  sample  command  involving  two 
columns  of  data)  and  REGRESSION.  This 
split  probably  occurred  because  a  number 
of  students  did  not  perform  any 
regression  analysis  due  to  the  fact  that 
exercise  14.47  included  regression 
output  from  SAS.  The  modal  response 
after  TWOSAMPLE  was  a  JOIN  command.  This 
command  was  necessary  since  the  data 
needed  to  be  restructured  in  order  to 
perform  a  REGRESSION.  Thus  it  was 
followed  by  a  SET  command.  After 
REGRESSION  the  modal  response  was  the 
AOVONEWAY,  an  ANOVA  command  involving 
two  columns  of  data.  This  command  was 
then  followed  by  a  Mini  tab  STOP  command 
in  the  modal  flow  chart.  The  students' 
modal  responses  covered  all  four  parts 
of  this  exercise,  data  entry,  two-sample 
t  test,  regression,  and  one-way  ANOVA. 
While  their  response  patterns  were  not 
identical  to  the  expert's  pattern,  their 
patterns  were  reasonably  close. 

The  students  for  our  next  experiment 
were  members  of  four  sections  of  an 
introductory  applied  statistics  course. 
Most  of  these  students  were  freshmen  who 
had  never  dealt  with  a  statistical 
package  before.  As  for  computer 
expertise  all  of  these  students  had 
either  taken  or  were  taking  an 
introductory  management  information 


course  in  which  they  are  required  to 
write  BASIC  programs  on  Babson's  VAX. 
Thus,  while  all  of  these  students  were 
Minitab  novices,  all  had  some 
familiarity  with  the  environment  in 
which  Minitab  resided. 

This  experiment  was  given  as  their  first 
computer  assignment.  For  this  project 
they  were  asked  to  describe  the  data 
present  in  Table  4. 

Table  4 

State  and  Local  Per  Capita  Tax  Burden 
in  Fiscal  1982-1983 


We  were  interested  in  monitoring  how 
these  students,  with  less  than  two  weeks 
of  limited  Minitab  classroom  exposure, 
would  enter  and  manipulate  the  data  in 
Minitab  from  the  given  map.  In  addition 
we  were  interested  in  learning  how  these 
students  would  describe  the  data  and 
display  the  data  graphically  using  the 
Minitab  commands. 

To  perform  this  experiment  the  students 
were  asked  to  access  NEWMINI  through  a 
program  called  CAl  on  a  hardcopy 
terminal,  not  a  CRT.  In  that  way  we 
obtained  hard  copies  of  their  runs  to 
check  against  the  monitoring  we  did 
using  NEWMINI. 

The  second  column  in  Table  1  presents 
our  initial  summary  of  this  experiment. 
Note  that  we  collected  a  lot  more  data 
in  this  experiment.  This  increase  was 
not  just  due  to  the  larger  number  of 
students.  It  was  also  due  to  the  nature 
of  the  problem  (Which  was  m-ch  more 
open-ended  than  CAO ) ,  the  nature  of  the 
assignment  (a  one-week  assignment,  in 
contrast  to  a  one-run  take-home  portion 
of  a  final  examination),  and  the  fact 
that  this  experiment  involved  mainly 


freshmen.  Table  1  also  show  that  the 
students  working  on  CAl  entered  many 
more  error  lines;  something  to  be 
expected  by  this  group  of  naive  users. 

A  number  of  other  interesting 
observations  come  forth  from  our  initial 
examination  of  the  data  from  this 
experiment.  For  instance,  the  number  of 
runs  ranged  from  1,  by  24  students,  to 
12,  by  2  students.  The  median  number  of 
runs  was  3.  The  median  number  of 
continuous  data  lines  entered  by  the 
students  was  1  due  to  the  large  number 
of  students  who  used  the  SET  command  to 
enter  the  51  pieces  of  data  from  the 
map.  The  two  most  frequently  used 
commands  were  PRINT  and  HISTOGRAM.  These 
commands  were  entered  9.68%  and  9.07%  of 
the  time,  respectively.  Based  upon  the 
experiment  these  responses  were  no 
surprise.  But  it  was  surprising  that  33 
different  users  entered  the 
KROSKAL-WALLIS  command  a  total  of  37 
different  times  for  this  assignment 
dealing  with  descriptive  statistics. 
Another  expected  response  was  that  the 
students  requested  HELP  on  the  HISTOGRAM 
most  frequently.  The  most  common  error 
was  the  entry  of  an  illegal  command 
which  constituted  23.70%  of  the  errors. 

From  observing  items  such  as  the  above 
from  the  572  runs  of  the  150  students 
who  participated  in  this  experiment,  we 
determined  six  places  where  the  Minitab 
statistical  package  might  be  improved. 
Most  of  these  improvements  deal  with 
increased  user-friendliness  although  our 
first  suggestion  might  be  viewed  as 
making  the  package  less  friendly.  To 
our  surprise  a  large  number  of  students 
were  entering  lines  without  any 
delimiting  blanks.  For  example,  DESCCl 
instead  of  DESC  Cl.  While  use  of  this 
entry  method  did  not  cause  the  user  any 
trouble  initially,  it  led  to  great 
difficulty  when  complex  commands 
appeared.  For  example,  the  entering  of 
HISTC1650  250  for  HIST  Cl  650  250  caused 
some  students  frustration.  Thus  we 
propose  that  blanks  be  required  as 
delimiters  in  all  Minitab  commands.  We 
also  propose  the  addition  of  at  least 
two  new  options  in  Minitab;  a  RANGE 
command  and  a  LABEL  ROW  command.  Many 
students  tried  to  use  these  features.  We 
would  like  Minitab  to  recognize  synonyms 
for  the  STOP  command.  One  leaves 
Minitab  by  issuing  a  STOP  command.  If 
one  enters  an  EXIT  or  QUIT  command,  they 
are  told  to  enter  STOP.  From  tlie 
frequency  of  such  requests  we  believe 
much  time  can  be  gained  by  allowing 
Minitab  users  to  leave  a  Minitab  session 
via  STOP,  EXIT,  QUIT,  BYE,  etc.  We  also 
feel  that  a  large  number  of  command 


Table  5 


errors  could  be  eliminated  if  Minitab 
accepted  commands  in  which  two 
characters  were  transposed.  Thus,  we 
would  like  to  see  the  acceptance  of  DECS 
and  HIEGHT  for  DESC  and  HEIGHT.  We 
believe  that  Minitab  should  better 
publicize  the  fact  that  operating  system 
commands  can  be  run  from  within  Minitab 
by  first  specifying  SYSTEM.  Many  of 
these  naive  students-users  tried  to 
obtain  a  directory  listing  of  their 
files  or  to  delete  an  existing  file  in 
Minitab.  In  addition  we  would  like  to 
see  Minitab  provide  a  local  command 
warning  option  so  that  an  individual 
location  could  alert  their  users  of 
Minitab's  inability  to  perform  a 
specific  task.  Similar  to  a  macro 
facility,  this  feature  could  be  used  to 
tell  students  that  they  could  not  run 
BASIC  within  Minitab. 

We  obtained  an  embryonic  understanding 
of  the  data  analysis  process  by 
observing  the  results  from  the  students' 
CAl  runs.  Here  are  a  few  of  our 
discoveries.  Of  the  data  entries  71.29% 
were  followed  by  an  END  command.  Thus 
the  majority  of  the  students  concluded 
their  data  entry  in  the  preferred  way. 
Almost  one-third  of  the  students  PRINTed 
our  their  data  after  entering  it.  It 
appears  that  even  in  the  first  two  weeks 
of  the  course  a  surprising  large  number 
of  students  were  verifying  their  entered 
data.  On  the  negative  side  most  of  the 
students  made  Inefficient  use  the  the 
Minitab  SAVE  command.  There  were  299 
SAVE  commands  and  only  305  RETRIEVE 
commandsl  In  contrast  to  an  expert 
these  naive  users  only  seem  to  be  using 
their  SAVEd  data  files  once.  It  was 
also  discouraging  to  see  how  poor  the 
choice  of  names  of  the  SAVE  files  were. 
We  believe  that  most  people  would  have 
trouble  recalling  what  was  the  content 
of  their  SAVEd  files  from  the  selected 
names.  Finally  we  noticed  that  a  major 
error  was  designating  the  wrong  number 
of  arguments  for  a  command. 

We  also  learned  something  about  the  data 
analysis  process  by  contrasting  the  data 
analysis  flow  chart  of  an  expert  (Table 
5)  with  the  modal  data  analysis  fiow 
chart  of  the  students  (Table  6). 
According  to  the  expert  the  sequence  of 
commands  to  prepare  the  data  for  this 
experiment  would  be  to  enter  the  data  by 
employing  the  SET  and  END  commands, 
label  that  variable  by  using  the  NAME 
command,  verify  the  entered  data  by 
issuing  a  PRINT  command,  and  then  place 
the  data  into  a  Minitab  file  by  entering 
a  SAVE  command.  The  expert  would 
DESCRIBE  the  data,  produce  a 
well-constructed  HISTOGRAM,  OMIT  the 


CAl  Expert  Data  Analysis  Flow  Chart 


Table  6 

CAl  Modal  Data  Analysis  Flow  Chart 


outlying  observations,  repeat  the 
description  and  displays,  and  finally 
leave  Minitab.  The  data  analysis  flow 
for  the  students  was  not  as  straight 
forward  although  the  students  did  use 
basically  the  same  commands.  They  used 
the  SET  and  END  commands  a  number  of 
times  before  issuing  the  NAME  command. 
(The  NAME  command  is  issued  repeatedly.) 
This  was  probably  due  to  the  errors 
these  inexperienced  Minitab  users 
introduced.  They  probably  used  the 
PRINT  command  repeatedly  for  the  same 
reason.  A  number  of  HISTOGRAM  commands 
followed.  The  next  node  on  the 
students'  modal  path  is  the  DESCRIBE 
command  which  was  usually  followed  by 
another  HISTOGRAM  command.  The  only 


other  branch  of  any  size  from  HISTOGRAM 
was  to  a  Y  prompt  (probably  to  guarantee 
the  printing  of  a  long  histogram).  From 
the  Y  prompt  most  students  went  back  to 
the  DESCRIBE  command.  Thus  this  modal 
data  analysis  flow  chart  produces  a  path 
which  does  not  reach  the  STOP  command,  a 
big  difference  from  the  path  taken  by 
the  expert. 

To  determine  why  the  students  did  not 
reach  the  STOP  command,  we  constructed  a 
data  analysis  flow  chart  starting  at 
that  command.  A  portion  of  that  chart 
may  be  found  in  Table  7. 

Table  7 
Path  to  Stop 

'  Eiro, 

Error  ^ 

l^STOP  ^ 


Here  we  see  quite  clearly  what  came 
before  the  conclusion  of  a  Minitab 
session.  The  modal  response  was  an 
error.  This  happens  96  times.  And 
before  this  error  557  times  another 
error  occurred.  The  way  most  students 
concluded  their  Minitab  sessions  for 
this  experiment  was  in  frustration. 

We  are  gathering  another  set  of  data 
from  the  students  who  are  taking  their 
introductory  applied  statistics  course. 
This  experiment  deals  with  a  linear 
regression  modeling  assignment.  In 
constrast  to  being  the  student’s  initial 
computer  project,  it  will  be  their  last. 
We  hope  to  determine  what  response 
pattern  changes,  if  any,  have  occurred 
in  each  student  over  the  course  of  a 
fifteen-week  semester.  Here  we  plan  to 
provide  to  each  student  a  different,  but 
related,  data  set.  In  this  way  we  hope 
to  prevent  sharing  of  commands  by  a 
group  of  students.  Initially  these  data 
were  to  be  made  available  to  the 
students  in  a  computer-file  to  save  the 
students  some  time,  but  in  order  to 
monitor  any  difficulties  with  data 
entry,  we  will  have  the  students  enter 
the  data  into  the  computer  themselves. 


Creating  an  Elementary  Preprocessor 

As  mentioned  above  we  are  in  the  midst 
of  creating  a  preprocessor  to  aid  our 
students  in  the  selection  of  the 
appropriate  Minitab  command  for  a 
specific  analysis.  This  front-end  will 
be  designed  similar  to  the  charts 
prepared  by  Andrews  et  al  (1981)  and  the 
Statpath  software  outlined  by  Portier 
and  Lai  (1983).  Initially  we  will  base 
our  preprocessor  on  Version  85  of 
Mlnitab. 

Developing  Report  Generators  for 
Introductory  Applied  Statistics  Courses 

Our  report  generator  plans  include 
backends  for  the  following  topics 
usually  found  in  an  introductory  applied 
statistics  course:  confidence  intervals 
and  hypothesis  tests  for  the  population 
mean  and  for  the  difference  between  two 
population  means,  simple  linear 
regression,  and  chi-square  tests  for 
independence  and  for  equal  proportions. 
We  also  plan  to  produce  incorrect 
reports  on  these  topics  so  that  the 
student  can  be  taught  to  criticize  the 
computer-generated  output. 

Future  Directions 

We  believe  the  work  we  have  began  in 
monitoring  and  evaluating  user-response 
patterns  will  produce  a  better 
understanding  of  the  data  analysis 
process?  thereby  enabling  us  to  better 
understand  the  students'  techniques  and 
to  develop  our  preprocessor  and  report 
generators.  In  addition  we  will  be  able 
to  suggest  enhancements  to  existing  and 
future  statistical  software.  We  also 
believe  that  there  is  much  more  that  can 
be  gained  by  extending  our  initial  work. 
Four  possible  extensions  are  more  varied 
experiments,  the  introduction  of  more 
variables  in  these  experiments,  better 
collection  devices,  and  more  complete 
error  analysis. 

Additional  experiments  to  understand  how 
users  employ  statistical  software  will 
be  based  on  analyses  other  than  those 
mentioned.  In  addition  all  these 
experiments  could  be  performed  by 
students  at  other  schools  and  by 
non-student  users  of  statistical 
software.  Monitoring  and  evaluating  of 
user-response  patterns  could  also  be 
performed  on  other  types  of  statistical 
software.  In  contrast  to  Minitab,  an 
interactive  package  with  command  lines, 
there  are  batch  packages  and  interactive 
packages  with  prompt  command  lines, 
cursor  menus,  and  mouse  menus.  User 
demographic  variables  should  also  be 


brought  into  these  experiments  along 
with  time  variables  for  experiments 
dealing  with  interactive  systems. 

Some  ideas  for  better  collection  devices 
include  devices  which  capture  all  the 
user  entries  including  errors  which  the 
user  corrects  before  sending  them  to  the 
package.  Completely  automated  devices, 
possibly  part  of  the  software,  are 
another  possibility.  Finally  devices 
which  enable  one  to  examine  random 
samples  of  the  user  population  are 
desired  extension  of  our  work. 

In  addition  to  dealing  with  the 
forgiving  errors  of  syntactical  or 
semantic  natures  complete  evaluation  of 

user-response  patterns  should  analyze 
typographical  errors,  fatal  errors,  and 
logical  errors.  Note  these  errors, 
especially  logical  ones,  will  be 
difficult  to  examine. 
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ABSTRACT 

Traditional  databases  accommodate  statistical  applications  -  and  other  applications  -  at  an  abstraction  level 
higher  than  the  user  level.  But  statistical  analysis  possesses  exploitable  properties  that  can  be  used  to  integrate 
the  realization  of  statistical  functions  with  the  database  activity  of  data  acquisition.  If  statistical  functions  arc 
parameterized,  it  is  easy  to  see  that  they  share  many  common  parameters.  These  parameters  are  both  updatable 
and  additive.  Statistical  functions  and  their  parameters  are  explored  together  with  round-off  errors  resulting 
from  updates. 


1.  Introduction 

Attcnlion  has  been  drawn  recently  to  the  inadequacy  of 
current  databases  in  accommodating  statistical  analysis  (2,  12, 
24,  3 1 ).  The  inadequacy  arises  from  the  intrinsic  structure  of 
statistical  analysis  and  the  inability  of  the  underlying  models 
of  database  systems  to  capture  and  correctly  model  statistical 
structure  efTiciently.  The  proposed  model  is  concerned  with 
exploiting  three  properties  of  statistical  analysis  that  result  in 
inefficiencies  when  realized  in  traditional  databases.  They  are: 
(I)  the  nature  of  statistical  queries  (2)  the  nature  of  statistical 
calculations  and  (3)  statistical  classifications. 


I.l.  Statistical  Queries 

Turner  et.  al.  (35]  have  distinguished  three  types  of  queries  - 
informational,  operational  and  statistical  queries.  Statistical 
queries  request  over  10%  of  the  records  in  a  database  while 
the  other  types  request  less.  One  consequence  of  this  is  that 
response  times  are  high  when  statistical  queries  are  made. 
Another  important  fact  is  that  databases,  in  both  research  and 
practice,  have  been  geared  primarily  towards  informational, 
and  to  a  lesser  extent,  operational  queries  -  the  concept  of  pri¬ 
mary  keys  (informational)  and  secondary  keys  (operational) 
and  the  subsequent  theories  underlying  the  various  normal 
forms  (3,  10,  II].  Statistical  queries  have  been  surbordinated. 
As  a  result,  man-made  attributes  such  as  social  security 
number,  employee  number,  etc.,  have  become  the  focal  point 
for  retrieval  in  databases.  As  Turner  el.  al.  have  pointed  out, 
statistical  queries  are  based  on  more  natural  attributes  like  'all 
females',  'all  black  employees',  'minority  engineers'  etc.  The 
consequence  of  this  is  that  a  statistical  user  is  faced  with  the 
task  of  specifying  statistical  queries  in  terms  of  informational 
and  operational  queries.  Very  complex  formulations  often 
result  (17].  So,  we  can  conclude  that  the  statistical  interface  in 
many  commercial  systems  is  unfriendly. 


1.2.  Statistical  Calculations 

A  major  source  of  redundancies  in  statistical  applications  is 
in  the  function  calculations.  If  statistical  functions  arc 
parameterized  and  viewed  in  terms  of  these  parameters,  it  will 
be  seen  that  many  of  them  share  common  parameters.  For 
example,  to  calculate  the  Pearson's  correlation  cocrficicnt 
between  two  variables  (attributes)  A  and  B,  we  need  the  fol¬ 


lowing  parameters:  sample  size,  n;  the  sum  of  products  of  A 
and  B;  the  sum  of  squares  for  A  and  the  sum  of  squares  for  B. 
On  the  other  hand,  to  calculate  the  standard  deviation  for  A 
(or  B),  the  parameters  are:  n  and  the  sum  of  squares  for  A  (or 
B).  So,  if  the  parameters  for  the  correlation  coefficient  are 
available,  no  extra  retrieval  is  necessary  to  compute  the  stan¬ 
dard  deviation.  The  calculations  for  a  multiple  regression 
model  begin  with  a  matrix  of  Pearson's  correlation  coefficients 
(whose  parameters  have  been  mentioned  above).  The  same 
matrix  is  the  starting  point  for  factor  analysis  and  discriminant 
analysis.  Thus,  thinking  of  statistical  functions  in  terms  of 
their  parameters  can  save  redundant  calculations.  The  param¬ 
eters  of  many  statistical  functions  ranging  from  simple 
descriptive  statistics  to  multivariate  analyses,  will  be  explored. 

An  interesting  property  of  these  statistical  parameters  is  that 
they  are  'updatable'  in  the  sense  that  if  P„  is  the  parameter  for 
n  points,  it  is  possible  to  compute  P.+i  given  n,  P,  and  o,+i, 
where  is  the  new  datum.  The  updating  formulas  for  some 
parameters  have  been  known  for  some  time  starting  with 
Welford's  pioneering  paper  (37];  and  more  recently  also  (6, 
19,  22].  The  meaning  of  this  is  that  if  all  required  parameters 
are  known  a  priori,  they  can  be  kept  current  during  data 
acquisition,  i.e.,  during  insertions,  deletions  and  modifications. 
Updating  some  parameters  introduces  additional  round-off 
errors,  however.  The  nature  of  these  errors  for  some  parame¬ 
ters  has  been  investigated  (6,  21,  27,  38J.  This  paper  extends 
these  investigations  and  show  how  they  affect  the  final  func¬ 
tion  values.  It  should  be  mentioned  that  batch  updates  are  also 
possible  using  the  additive  formulas  or  transaction  type 
updates  (31]. 


1.3.  Statistical  Classifications 

Classification  is  an  inherent  part  of  statistical  analysis.  Many 
classification  schemes  take  the  form  of  'treatments'  or 
'categories'  by  which  some  metric  is  grouped.  For  example: 
suppose  there  is  a  'category'  attribute,  RACE,  whose  domain 
is  (hispaniefh),  whitc(w),  black(b),  other(o)),  by  which  dif¬ 
ferent  users  choose  to  classify  a  'data'  attribute,  SALARY. 
Note  that  a  category  is  a  member  of  the  power  set  of  the 
domain  of  a  category  attribute.  Let  us  suppose  that  we  have 
the  following  categories:  wbh,  bh,  and  o.  We  shall  associate 
with  each  category,  a  set  of  parameters  calculated  on  the  data 
attribute,  that  is  necessary  to  realize  some  statistical 
function(s)  for  the  category.  So  if  a  user  (or  users)  is 


interested  in  the  mean  SALARY,  the  associated  parameters 
are  count  and  sum.  Now,  suppose  another  (or  same)  user  is 
now  interested  in  the  mean  SALARY  of  the  category,  bho,  no 
additional  parameter  gathering  is  necessary  since  this  can  be 
derived  by  combining  the  parameters  for  bh  and  o.  Almost  all 
statistical  parameters  are  'additive'.  The  additive  operation  of 
sum  or  count  is  an  arithmetic  addition.  The  additive  operation 
of  many  other  statistical  parameters  involve  many  arithmetic 
operations. 

The  general  rules  for  deriving  new  categories  are  by  set  union 
if  the  categories  are  disjoint  (addition  of  parameters)  and  by 
set  dilTerence  if  the  two  categories  have  a  super/sub-set  rela¬ 
tionship  (subtraction  of  parameters)  Thus,  in  the  above  exam¬ 
ple,  the  additional  categories  that  can  be  derived  are  wbho, 
wo  and  w.  Again,  it  is  obvious  that  redundancies  can  exist  if 
these  deriving  rules  are  not  applied. 


2.  Related  Work 

Many  attempts  have  been  made  to  accommodate  some  of  the 
problems  that  have  been  stated.  The  two  most  common 
approaches  are  either  to  build  specialized  statistical  databases 
or  to  integrate  statistical  analysis  tools  with  commercial  data¬ 
bases.  Some  specialized  systems  include:  the  use  of  inverted 
file  structures,  like  in  TDMS  [4];  the  use  of  'transposed'  files 
such  as  in  RAPID  [33];  and  the  use  of  special  data  structures 
as  in  SUBJECT  [7].  Integrated  systems  take  the  form  of  pro¬ 
viding  better  interfiices  between  the  two  systems  while  pro¬ 
viding  a  rich  operational  repertoire  for  each  subsystem.  Such 
systems  will  include  REGIS  [18],  RIGEL  [28],  etc.  There  are 
more  examples  [31).  The  first  approach  lacks  generality.  While 
our  approach  here  is  of  the  second  type,  it  differs  from  others 
in  that  it  models  the  statistical  subsystem  at  all  three  levels  of 
classical  database  design,  with  interlVices  between  the  two  sub¬ 
systems  at  the  two  top  levels.  The  preponderance  of  commer¬ 
cial  systems  tends  to  suggest  that  integration  is  the  preferred 
means  of  reaching  many  statistical  users. 

As  mentioned  earlier,  the  updatability  and  additivity  of  many 
statistical  parameters  have  been  known.  But  these  have  been 
mostly  limited  to  means  and  standard  deviation  calculations. 
In  SYSTEM  R  [1],  the  'trigger'  concept  is  updatability  applied 
to  simple  aggregate  functions.  Triggers  lack  globality.  Koenig 
and  Paige  in  MADAM  [19],  allow  one  to  define  functions  in 
terms  of  simpler  functions  (parameters)  -  mean  defined  in 
terms  of  sum,  for  example.  No  global  sharing  is  apparent. 

Sato  [30]  has  given  a  system  of  classification  and  rules  of 
derivation.  However,  the  categories  are  pairwise  disjoint  and 
the  database  is  static.  We  are  considering  a  more  general  clas¬ 
sification  scheme  in  a  dynamic  database. 

Many  investigations  [16,  21,  27,  38)  about  the  size  of  round¬ 
off  errors  have  been  carried  out  based  on  Wilkinson's  work 
[39).  Most  of  them  have  been  empirical  however.  The  reason 
for  this  is  probably  due  to  the  tedious  nature  of  the  much 
often  desired  'forward'  analysis  involved  in  these  algebraic 
proccs.scs.  But  Chan  and  Lewis  [5]  have  developed  theoretical 
upper  bounds  for  mean  and  standard  deviation  calculations. 


This  paper  will  first  concentrate  on  the  statistical  aspect  of 
this  model  and  later,  the  integrated  system. 


.1.  Statistical  Parameters 

The  updatability  of  some  statistical  parameters  has  been  inves¬ 
tigated  (16.  21,  22,  37,  38,  40].  The  treatment  however,  has 


been  to  regard  these  parameters  as  final  function  values  in 
their  own  right  rather  than  as  parameters  to  many  statistical 
functions.  The  additivity  of  these  parameters  also  warrants  a 
separate  treatment  -  updatability  (a  special  case  of  additiviy) 
is  employed  during  data-  acquisition  to  keep  parameters 
current,  while  additivity  is  used  for  the  derivation  of  new 
categories  and/or  merging  of  batch  updates  into  a  main  data¬ 
base.  The  updating  and  additive  formulas  are  derived  by  sim¬ 
ple  algebraic  manipulations.  These  formulas  are  now 
presented  and  many  statictical  functions  are  parameterized. 

In  what  follows,  the  beginning  letters  of  the  alphabet.  A,  B, 
C,  „.  are  single  attributes  whose  values  in  a  table  instance  of 

n-l  records  are  [(IlOj . n.-i).  [hi.ftj . h.-i)  ...  respectively, 

where  (...)  is  used  to  denote  a  multiset.  If',  X,  Y,  Z  will  be 
used  to  denote  sets  of  attributes.  For  instance,  X-AiA^  ■■■  At 
and  d-|Y|,  is  the  dimension  for  the  attribute  set  X.  The 
records  here  wilt  be  drawn  from  the  cross  product  of  the 
domains  of  the  attributes  in  X  and  corresponding  small  letters 
are  used  to  denote  the  records. 

3.1.  Updatability 

Updatability  refers  to  how  to  calculate  P(Y),  from  P(Y),_i 
and  the  new  datum,  x,.  The  updating  formulas  for  many 
parameters  -  counts;  sums;  sum  squares,  sum  cubed,  ...;  pro¬ 
duct  sums  (and  powers);  etc  ...  -  are  of  the  form 

P(X),~P(X).^+/(x.),  P(Xh-0 

where,  /(.x.)  is  the  initial  calculation  for  the  term  to  be 
added,  depending  on  the  parameter.  For  instance,  for  count 
and  sum,  / (x,)  is  the  identity  function;  for  sum  cubed, 
/(Jt.)=o?Oa  •  ■  "z;  etc. 

However,  the  formulas  for  other  parameters  require  more 
than  one  addition  operation.  For  the  mean  (which  can  be 
parameterized),  Welford  (37),  gave  the  following; 

/’(Y).=  ^  P(X).^HI/")/(x.),  P(X)o-0. 

Similarly,  the  sum  of  squares  is 

F(/l  ).=/’(/!).-,+  ^  (n,-A/(X )._,)’,  fXAk-O. 

and  the  sum  of  products  is 

P(X).-P(X)._,+  ^ 

where  ff  is  the  parameter  mean.  Generally,  for  the  sum  of 
squares  and  the  sum  of  products,  d  =  l  and  rf=2  respectively. 
To  further  reduce  the  roundoff  errors  from  the  increased 
number  of  operations,  Rcckan  (38)  proposed  for  the  mean  and 
sum  of  squares,  the  following: 

P(X).=P(X)..,H/(xJ-P(X).,t)/>i 

and 

P(A  ).-^PfA  ),_,+(n.-A/(.4  )._,)’-(o.-A/(/l  ).^)V»i. 

An  additional  advantage  is  that  the  sum  and  sum  squares  can 
be  derived  more  accurately. 

Downdaiing  formulas  can  also  be  derived  by  algebraic  mani¬ 
pulation  of  above  formulas  -  this  corresponds  to  the  deletion 
of  a  record  from  a  database.  West  has  shown  that  Reckan’s 
formula  gives  a  better  accuracy  than  other  alternative 
methods.  However,  in  the  context  of  a  database,  separate 
parameters  can  be  kept  for  deleted  records  using  only  the 
updating  formulas.  Then  at  the  time  of  actual  function  compu¬ 
tation,  the  parameters  can  be  'added'  together. 


3.2.  Additivity 


is  used  for  calculating  the  parameters  of  derivable  categories. 


Additivity  is  explained  in  the  following  way.  Let  C(Ar),  be  a 
multiset  of  size  n,  and  P{X)(,  the  associated  parameter.  Then 
additivity  is:  a)  given  PlX)i,  i  -l,.  .k ,  compute  P(X)*,  for  the 
multiset  C(.X)*  of  size  n,  ti*/i,+nj+  ■  ■  ■  +«».  That  is,  given 
the  parameters  of  k  multisets,  find  the  parameter  of  the 
merged  set;  Or  b)  given  CfT),  and  C(X),,  /■>I,...J(  such  that 
C(X),Dj],‘C(X)i,  compute  PiX)"  for  the  multiset  C(Xy  of 
size  m,  where  and 

C(.Vr-C(A  ),-C(Af),-C(A')j - -C(.X)t.  That  is,  given  the 

parameters  of  a  set  of  multisets  and  the  parameter  of  a  mul¬ 
tiset  containing  them,  calculate  the  parameter  of  the  resulting 
multiset  after  removing  all  the  smaller  ones  from  the  big  mul¬ 
tiset.  The  additive  formulas  for  statistical  formulas  arc  now 
given  without  proofs.  The  proofs  are  given  elsewhere  [24). 

For  the  parameters  count;  sum;  sum  square,  sum  cubed,  ...; 
product  sums;  etc,  the  additive  formulas  are  of  the  form 

P(X)*-’EUP(X),  and  P{X)-.PiX),-^^^,P(Xy 

For  mean,  we  have  that 

P{Xy-\/nj:^^,P{X).n, 

and 

/>(,Vr-l/m[/>(T),/t,-S.‘=,/’(A- ).«,]. 

The  additive  formulas  for  sum  of  squares  are 

and 

-rwiA/f.r)-)’. 

where  A/  is  the  mean. 

For  the  sum  of  products,  wc  have 

P{Xy~'£,^,^(PiXh*iuM(A)>SHB),)-nM{A)*M{B)\ 

and 

/’(,r)-./’(A-),+»r,Af(/(),A/(B),-J],‘=,(P(A). 

+«,A/(4)iAf(B),)-mAf(/<)-Af(B)-. 

The  formula  has  been  given  here  for  X-AB,  i.c.,  d=2.  It  can 
easily  be  extended  for  d>2. 

The  condition  number,  «,  of  a  multiset  is  a  measure  of  how 
well  conditioned  the  numbers  are.  k  as  defined  by  Chan  and 
Lewis  (5)  is  given  by 

«-ll/(C(.V))||/((n-l)‘/’D), 

where  D  is  the  standard  deviation  and  ||/(C(.V)||  is  the 
Fuclidcan  norm,  it  is  a  parameter  used  to  keep  track  of  how 
good  a  comput,ition  is  given  the  structure  (relative  magnitude) 
of  the  numbers,  k  is  updatable.  The  additive  formula  is, 

)+niA/(,4  ).’))/5(/< )+)'/’ 

and 

where  S  and  Af  are  the  sum  of  squares  and  mean  respectively. 

Wc  have  now  seen  both  the  updating  and  additive  formulas  of 
almost  all  the  statistical  parameters  that  will  be  needed. 
I'pdaiing  is  used  to  keep  parameters  current  while  additivity 


33.  Statistical  runctlons  and  their  parameters 

Some  representative  statistical  functions  and  their  parameters 
are  discussed.  These  functions  can  be  found  in  most  introduc¬ 
tory  statistical  textbooks  and  are  offered  in  many  statistical 
packages  (23,  29). 

Basic  screening  functions  consist  of  frequency  count,  sum, 
mean,  variance  and  standard  deviations;  and  to  a  lesser 
extent,  skewness  and  kurtosis.  The  parameters  are  counts, 
sum  (or  mean)  and  sum  of  squares  for  most  of  them.  For 
skewness  and  kurtosis,  the  parameters  are  count,  sum,  sum 
square,  sum  cube  and  sum  fourth.  A  measure  of  correlation 
between  two  attributes  frequently  used  is  the  Pearson's  corre¬ 
lation  coefficient  and  the  parameters  are  counts,  sums  of 
squares  and  sum  of  products. 

Contingency  tables  are  two  dimensional  tables  on  the  attri¬ 
butes  A  and  B  with  counts  in  each  cell.  Some  functions  asso¬ 
ciated  with  a  contingency  table  are  y’,  d,  Cramer's  V,  the 
contingency  coefficient  C,  tau  B,  tau  C  and  Spearman's 
correlation  coefficient.  All  these  functions  can  be  computed 
from  the  cell  counts  of  the  table  -  thus,  the  parameters  are 
counts. 

A  class  of  statistical  functions  are  used  for  parametric  and 
non-paramctric  hypothesis  testing.  In  the  former  some  infor¬ 
mation  is  known  about  the  population.  In  hypothesizing  about 
a  single  mean  where  the  standard  deviation,  a,  is  known,  the 
standard  normal  statistic,  Z,  is  used,  where 

_  (M(A).-mo) 

atl‘^ 

and  Mo  is  Ihs  mean  being  tested  for.  If  the  population  is  nor¬ 
mal,  the  T  statistic  is  used  and  the  formula  is  similar  to  that  of 
Z  except  for  the  sample  standard  deviation  substituted  for  a. 
The  parameters  for  both  Z  and  T  are  counts,  sums  and  sums 
of  squares.  The  parameters  are  the  same  when  hypothesizing 
about  two  means  and  one  or  two  variances  (F  statistic).  Sign 
test  is  a  form  of  non-pai.iinetric  hypothesis  testing  that  uses  a 
Z  statistic  computed  from  counts.  For  run  tests  on  the  other 
hand,  counts  can  only  be  kept  incrementally  if  the  input  is 
serialized 

Experimental  classifications  are  concerned  with  the  means 
and  variances  of  k  populations  -  the  entire  population  having 
been  subjected  to  at  least  one  treatment.  One-way  classifica¬ 
tion  is  concerned  with  the  means  and  variances  of  k  popula¬ 
tions  resulting  from  one  treatment.  In  a  database  context,  a 
treatment  might  be  the  category  attribute  RACE  and  the 
equality  (or  inequality)  about  the  means  of  data  (or  summary) 
attribute,  say  SALARY,  the  point  of  interest.  If  the  popula¬ 
tions  arc  normal  and  have  the  same  variance,  the  model  equa¬ 
tion  is  given  by 

SST-SSEASS[Tr). 

where  SST  is  the  sum  of  squares,  SSE  is  the  error  sum  of 
squares  and  SS{Tr)  i.s  the  treatment  sum  of  squares.  Express¬ 
ing  this  model  within  the  parametric  framework  gives, 

E‘,.eA.K )")’-E^i  (E'i(«.r-«('''  )|)’) 

Thus,  the  parameters  arc  the  sum  of  squares,  sums  and  counts. 
Analagous  formulas  arc  dcrivalMc  for  the  general  n-way  clas¬ 
sification  problem  with  the  same  parameters. 


Among  th;  class  of  statistical  Tunctions  in  multivariate 
analysis,  perhaps,  the  most  common  are  multiple  linear  regres¬ 
sion,  factor  analysis  and  discriminant  analysis.  For  this  class 
of  problems,  the  beginning  point  of  computation  is  the  square 
matrix 

[F(,],  l<ij<n, 

where  n  is  the  number  of  variables  (i.e.,  attributes)  and  is 
the  Pearson's  correlation  coefficient  for  each  pair  of  variables. 
For  linear  regression,  the  variables  include  both  the  depen¬ 
dent  and  independent  variables.  Discriminant  analysis  may 
require  more  than  one  such  matrix.  Thus,  the  parameters  for 
this  class  of  problems  are  those  for  the  Pearson's  correlation 
coefficients  -  sums,  sums  of  squares  and  counts. 

Many  statistical  functions  and  their  parameters  have  been  dis¬ 
cussed.  But  there  are  some  statistical  functions  that  do  not 
easily  lend  themselves  to  this  parameterization.  They  include 
minimum,  maximum  (and  so,  range)  and  median.  The  reason 
why  this  is  so  is  because  each  of  these  functions  is  an  attribute 
value  of  a  record  in  the  database  and  so  if  this  particular 
record  is  modified  (or  deleted),  the  database  has  to  be 
scanned  for  a  new  function  value.  However,  it  is  possible  that 
for  such  a  function,  there  may  be  many  records  with  its  value 
and  so  counts  can  be  kept  together  with  the  function  value  of 
such  records.  Thus,  no  scanning  is  necessary  unlit  the  count 
becomes  zero. 


4.  Error  Analysis 


4.1,  Orervltw 

Use  of  the  various  updating  formulas  in  computation  will 
mean  increased  floating-point  operations.  Floating-point 
operations  result  in  errors  due  to  round-olTs  and  catastrophi 
cancellations.  The  size  of  the  error  also  depends  on  the  com¬ 
puter  (word  size,  guard  bits,  etc).  Thus  it  is  important  to 
understand  the  nature  of  these  errors  while  using  these  updat¬ 
ing  formulas.  A  model  for  studying  the  size  of  these  floating¬ 
point  errors  has  been  developed  [39].  Most  analysis  have  been 
empirical  however.  The  reason  is  probably  due  to  the  tedious 
nature  of  the  much  often  desired  'forward'  analysis  in  these 
statistical  processes.  In  forward  analysis,  we  desire  to  know 
the  pertubation  introduced  in  a  function  F  as  a  result  of  the 
perturbations  in  a  multiset  C(T)  -  i.e.,  given  ACIT),  what  is 
AF?  In  particular,  the  relative  error  of  F  (AF/F)  is  of 
interest  and  is  generally  expressed  as  kAC(T),  where  it  is  the 
condition  number  of  the  multiset. 

Chan  and  Lewis  (5,  6]  have  a  framework  from  developing 
theoretical  upper  bounds  for  some  of  these  complex  statistical 
formulas.  From  a  set  of  axioms,  upper  bounds  were  developed 
for  means  and  variances.  Some  of  these  bounds  are  expressed 
in  terms  of  a,  the  condition  number  (denned  above).  a>l 
always  and  for  large  n,  a  is  approximately  MIS,  the  recipro¬ 
cal  of  the  cocrTicicnt  of  variation.  In  general,  a- 1  implies  the 
numbers  are  well  conditioned  while  numbers  with  a»l  are 
badly  conditioned.  It  is  shown  that  errors  for  means  and  vari¬ 
ances  are  proportional  to  ri  or  ri'f*. 

This  approach  is  used  to  find  the  errors  for  all  st.atistical 
parameters  and  functions  (excluding  those  that  involve  only 
counts  (or  sizes)  since  the  computation  of  the  parameters 
insoisc  only  integer  arithmetic)  It  will  be  assumed  that  the 
parameters  are  accumulated  in  twice  the  precision  of  the  final 
function  salucs  to  mitigate  the  errors.  Absolute  errors  are 
computed  for  the  parameters  and  relative  errors  for  the  func¬ 
tion  values  In  what  follows,  i  and  p  are  the  smallest  represent¬ 


able  floating  point  numbers  in  single  and  double  precision 
respectively,  on  a  given  computer. 


4.2.  Errors  In  Parameters 
For  sum,  T,, 

|Ar.|<ri’^|j4||p+0(/.’). 

General  sum  of  k  -th  power  parameters  of  the  form 

|Ar.(*.)|-E,v.\ 

have  errors  given  by 

|Ar.(k)|<(rt  +ri‘/’)|M||*/,+0(r.’). 

This  formula  is  easily  gencralizable  to  higher  order  product 
sums. 

For  the  mean,  many  possible  updating  formulas  exist.  We  give 
the  result  for  appplying  Rcckan's  formula,  which  is  amongst 
the  least: 

IAr.l<((2/3),i'^-r4)ll^llpW(p^). 

Various  formulas  exist  for  the  computation  of  the  sum  of 
squares.  The  bound  for  Rcckan's  is  given.  The  error  is 

I  AT,  1  <pT.  +(3.4 1 4+«  +« ‘  A)  1 1  .T  I  j  V+0(p’) 
and  for  the  sum  of  products, 

|Ar.|<(«+3)r.^+o(z.’). 

As  pointed  out  by  Chan  and  Lewis,  the  choice  of  the  mean 
computation  method  docs  not  affect  the  the  error  in  the  sum 
of  products. 


4J.  Statistical  Functions 

With  the  parameters  in  double  precision  (2/  digits),  these 
functions  will  be  calculated  in  double  precision  and  then 
rounded  to  r  digits.  If  F,  is  a  final  function  value,  we  are 
interested  in  the  relative  error  -  |AF,  |/F,.  Recall  that  e  is  the 
single  precision  epsilon. 

For  sum,  the  relative  error  is, 

-^^^<nr>+e+0(p’). 

*  n 

For  the  mean  (Rcckan's),  the  relative  error  is. 


-*‘^/**<(.667ti  +2)p+e+0(/>’). 

r„ 

Many  statistical  functions  having  the  sum  of  squares  as  param¬ 
eter  share  similar  bounds  -  these  include  variance  and  stan¬ 
dard  deviation.  For  the  variance, 

<(>i  +4)p+c+(.94(t  +  l4.5ii‘^)icp+0(p’), 

•  n 

using  Wciford's.  This  error  is  proportional  to  n.  As  stated  by 
Chan  and  Lewis,  the  error  is  independent  of  ic  if  the  standard 
two-pass  method  of  computing  the  sum  of  squares  is  used  (but 
this  method  is  not  updatable)  until  k  grows  larger  than  l/p'^ 
when  the  term  in  O(itV)  becomes  significant.  The  error  is 
proportional  to  x’  if  some  other  methods  arc  used  (5,  24). 

For  parametric  hypothesis  testings,  the  relative  errors  for  both 
the  Z  and  T  statistics  arc. 


-^^=^<3/.+€+(2ii  ^+4K(«  - 1  )^S 

/(A/(^).-,io))<cp+0(p’) 

and 

3p+« +((«  +4)p+(.94«  +4.4 1  g;i  ‘/’)«cp)* 

*  ■ 

+(.667n*+4K(n-lrtS/(A/(/l  ).-Ho))kp*0(p\ 

where  A/  and  .S  arc  the  mean  and  standard  deviation  respec¬ 
tively.  The  errors  for  all  other  forms  of  hypothesis  testing  - 
two  means,  paired  sample  test  and  variances  -  are  propor¬ 
tional  to  K  and  are  0(n'^)  in  the  worst  case  (24). 

Using  Welford's  method  for  the  sum  of  squares  in  computing 
Pearson's  correlation  coefficient,  the  error  is 

^  r  *  ^  +4)p+«+(.667h‘-“+4«  )((ii  - 1  )\(S(A  ).K(/t ) 

•  « 

/( |fl.-A/{/(  ),\)HS(B ). I )))p 

*0(p^), 

for  some  constants  a,  and  h,.  Again,  the  error  is  proportional 
to  K  and  M . 

For  e.xpcrimcntal  classification,  the  errors  for  one-way  classifi¬ 
cation  arc. 


>i’  in  a  few.  This  error  is  also  dependent  on  it,  the  condition 
number.  Some  runs  similar  to  the  ones  performed  by  Chan 
and  Lewis  were  performed  on  a  DEC  1099  for  validation.  The 
results  were  generally  good  since  most  of  the  errors  occur 
while  performing  double  precision  arithmetic  -  for  instance, 
for  M  =10000,  in  the  worst  case,  at  least  five  digits  of  accuracy 
was  maintained  in  variance  calculations  on  the  average 
(DECC  1099,  can  hold  8.429  digits). 

With  the  theoretical  upper  bounds,  it  is  possible  to  monitor  the 
type  of  accuracy  (the  number  of  correct  leading  digits)  for 
each  parameter.  Intolerable  accuracies  due  to  a  large  n  can  be 
rectified  by  recalculating  the  particular  parameters  from  the 
underlying  database  but  this  time,  breaking  up  the  parameters 
into  say,  p,  smaller  parts.  These  p  parts  can  then  be  added 
(using  the  additive  formulas)  before  final  function  values  are 
calculated.  Too  many  insertions  and  deletions  may  also  war¬ 
rant  occasional  recalculation  of  parameters. 

Since  errors  resulting  from  down-dating  (removal  of  a  datum) 
are  larger,  it  may  be  advisable  to  calculate  the  parameters  for 
removed  data  separately.  Before  function  calculation,  these 
parameters  are  then  subtracted  to  get  the  actual  parameter 
values  -  recall,  the  P's. 


S.  The  Integrated  Model 


5.1.  Overview 

A  schematic  representation  of  the  model  is  as  shown  in  Fig.  1. 


5.2.  User  Level 


mn.t/AT- )  +0(p’). 


and 


SS(rr) 


<(k  +4)p+0(p’) 


<(fi  +k  +5)p+ma.x|Ar,  1  +0(p’). 


where  T,  is  the  sum  of  squares  for  the  i  th  treatment  group, 
and  k  is  the  number  of  groups.  For  two-way  classification, 
only  SST  and  SSE  arc  proportional  to  k.  For  higher  order 
classifications,  error  bounds  are  similar.  For  Latin  square  and 
factorial  designs  the  errors  are  proportional  to  n’. 


Users  are  allowed  to  specify  functions  of  interest.  These  i 
tions  are  drawn  from  a  catalog  provided  by  the  DBMS.  , 
parameters  and  their  updatable  and  additive  charn'"' 
arc  known  to  the  system.  These  statistical  functions  0|  ci 
data  attributes,  SALARY  for  example,  that  are  classified  by 
some  other  category  attributc(s)  like  RACE,  SEX,  AGE,  etc. 
Thus,  users  only  need  to  be  concerned  with  function  and  clas¬ 
sification  specifications.  A  distinction  is  made  between  a 
category  attribute  and  a  data  (or  summary)  attribute.  Turner 
et.  at.  have  described  category  attributes  as  those  with  small 
domain  sizes  and  so  with  great  ability  to  identify.  By  Steven’s 
typology  |33],  these  correspond  to  those  scales  that  arc  at  most 
ordinate  (qualitative).  We  use  the  term  here  to  include  all 


Errors  in  multivariate  analyses  are  more  complicated  to 
analyze.  But  the  errors  arc  related  since  the  computations  start 
with  a  matrix  of  Pearson's  correlation  matrix,  P.  For  linear 
regresssion.  the  problem  is  thus,  P'X-B.  P  is  symmetric  (and 
positive  definite?).  It  has  been  shown  (34|  that. 


ll-VII 


<cond(P) 


llAfJi. 

IIPIl  ’ 


if  P  is  perturbed.  1 1  .V 1 1  is  the  norm  (i.e.,  a  measure  of  magi- 
tude)  of  the  matrix  X.  Since  AP  and  P  arc  symmetric,  the 
relative  error  is  A,A„/A,,  where  the  As  arc  respectively,  the 
minimum  eigenvalue  for  P,  the  maximum  and  minimum  eigen¬ 
values  for  AP.  However,  it  is  desirable  to  express  the  error 


bound  in  terms  of  ii  and  k.  The  errors  for  factor  and  discrim¬ 


inant  analyses  are  similarly  bound. 
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4.4.  Error  Summary 

The  relative  errors  are  proportional  to  in  most  cases  and 


Fig.  I.  I  hc  Model 


those  attributes  that  are  inherently  qualitative  or  have  been 
coded  to  become  so  -  by  grouping,  for  example  Data  attri¬ 
butes  on  the  other  hand,  have  no  restrictions  although  in  most 
cases,  they  are  at  least  ordinal  (quantitative). 

The  DDL,  the  data  definition  language,  should  include  con¬ 
structs  that  allow  users  to  make  these  SPECifications.  Rightly, 
the  indirectly  SPECified  data  objects  are  the  categories  and 
their  associated  parameters.  Similarly,  the  DML,  the  data 
manipulation  language,  provides  users  the  ability  to  REQuest 
the  values  of  previously  specified  functions,  as  well  as  the 
ability  to  CANccI  any  previous  specifications.  Eor  example: 
SPEC  mean,  stddev:  SALARY  by  SEX 
SPEC  entgney:  SALARY  by  SEX  and  RACE 
R  EQ  mean:  S A  L A  R  Y  by  SEX 
CAN  entgney:  SALARY  by  SEX  and  RACE 

The  first  SPEC  states  that  a  user  wants  the  mean  and  standard 
deviation  of  the  SALARY  for  each  of  the  categories,  male 
and  female.  The  second  SPEC  is  for  a  2-dimcnsional  con¬ 
tingency  table  for  SALARY.  This  approach  of  'SPECify- 
beforc-usc’  is  justified  in  the  sense  that  statisticians  in  an 
enterprise  generally  know  what  analyses  are  of  interest,  par¬ 
ticularly,  after  an  'explorative'  phase  has  been  undertaken  (2). 
Market  research  enterprises  usually  have  well  defined  stable 
analytic  sets  that  only  change  periodically.  Also,  in  the  case  of 
experimental  situations,  the  factors  (category  attributes)  of 
interest,  the  dependent  variables  (data  attributes)  as  well  as 
the  type  of  analyses  are  well  defined.  REQ  is  an  actual 
REQuest  for  the  calculation  of  the  mean  salaries  for  males 
and  females.  CAN  is  a  CANccllation  of  a  previous  contigcncy 
table  SPECification.  Both  REQ  and  CAN  arc  part  of  the 
DML.  It  should  be  noted  that  these  are  not  exhaustive. 


5.3.  Conceptual  Lei  el 

With  a  knowledge  of  all  specified  functions,  the  DBMS  can 
easily  determine  the  necessary  parameters  to  realize  them. 
Since  these  parameters  arc  updatable,  they  can  be  kept 
current  during  data  acquisition.  It  has  been  suggested  (.)l|  lh.it 
statistical  databases  arc  stable  in  the  sense  that  after  the  initial 
data  entry  and  correction,  there  arc  few  or  no  upd.atcs  to  the 
databa.se.  While  this  is  true  for  pure  'statistical  data'  like 
census  data,  it  is  not  necessarily  true  for  general  commercial 
systems.  1  here  is  no  doubt  that  too  many  parameters  and  fre¬ 
quent  updates  arc  bound  to  slow  down  data  acquisition.  How¬ 
ever.  there  arc  applications  in  which  updates  arc  infrequent. 
Additionally,  there  arc  those  in  which  updates  come  in 
batches  -  market  surveys,  for  instance  |20].  In  this  case, 
straightforward  methods  of  calculating  these  parameters  can 
lie  applied  to  a  new  batch  and  these  parameters  added  to 
those  of  the  original  database  when  the  new  batch  is  merged. 

Base  classifications  -  a  set  of  categories  -  arc  kept  at  this  level 
to  exploit  the  additivity  of  the  parameters.  From  the  base  clas¬ 
sifications.  the  DBMS  can  determine  if  newly  specified 
categories  arc  derivable,  thus  eliminating  redundancies  arising 
from  user  classifications.  Redundancies  arising  from  different 
users  can  also  be  eliminated  if  a  base  classification  is  not 
redundant.  Unfortunately,  deriving  such  a  base  classification 
is  a  difficult  problem.  Also,  to  determine  if  a  new  category  is 
derivable  from  a  given  set  of  categories  has  been  proved  to  be 
NP-compIcte  |24).  Thus,  it  seems  that  instead  of  a  base  classif¬ 
ication  that  is  sound  and  complete  with  respect  to  user  classif¬ 
ications.  a  classification  that  is  complete  but  not  necessarily 
sound  (if  easy  to  compute),  will  be  desirable.  It  should  be 
noted  that  if  user  categories  arc  disjoint,  then  the  derivation 
of  the  base  classification  and  dcrivabiiity  become  polynomially 
computable. 


A  classification  system  CS,  roughly,  is  conceptually  a  collec¬ 
tion  of  categories  and  their  parameters.  A  formal  model  is  dis¬ 
cussed  elsewhere  [24,  25j.  In  an  enterprise,  many  CSs  exist  at 
this  level,  and  arc  chosen  to  meet  the  demands  of  the  enter¬ 
prise.  The  category  and  data  attribute  sets  in  a  CS  will  be 
chosen  according  to  the  'target  functions'  -  target  functions, 
being  those  functions  that  share  common  parameters.  Hence, 
for  example,  one  CS  may  suffice  for  all  functions  related  to 
counts  such  as  frequency,  contingency  tables,  etc.  Another  CS 
defined  on  a  data  attribute  set  may  exist  for  multivariate  ana¬ 
lytic  functions.  The  actual  number  of  CSs  at  this  level  for  an 
enterprise  will  depend  on  the  number  of  target  functions  and 
the  number  of  data  attribute  sets. 

Processing  at  this  level  will  include:  the  retrieval  of  parame¬ 
ters;  dcrivabiiity  tests  (24)  for  categories;  the  addition  and 
deletion  of  categories.  Retrieval  of  parameters  will  be  in 
response  to  user  REQuests.  A  user  SPECification  always 
implies  a  dcrivabiiity  test  for  each  of  the  category  in  the 
SPECification.  Categories  that  arc  not  derivable  are  added  to 
the  classification  and  their  initial  parameters  calculated  from 
the  database  -  may  be,  with  the  permission  of  the  DBA.  It 
should  be  noted  that  the  category  attribute  set  of  a  user  may 
be  a  subset  of  the  category  attribute  set  at  the  conceptual 
level.  In  this  case,  the  unspecified  attributes  are  aggregated 
over  [7,  32).  For  example,  if  a  user  specifics  the  category 
'hispanic',  this  will  become  'hispanic&(fcmale,  male)'  at  the 
conceptual  level  if  the  category  attribute  set  is  (RACE,  SEX) 
at  this  level.  Categories  arc  removed  in  response  to  user  CAN- 
cellations  when  necessary. 


S.4.  Physical  Level 

The  physical  level  is  concerned  with  the  actual  storage  of  the 
base  classifications  and  the  parameters  associated  with  the 
categories.  EITicicnt  storage  and  retrieval  structures  will  be 
needed. 


5.S.  Interfaces  and  Mappings 

The  arrows  in  Fig.  I  show  where  the  interfaces  exist.  Besides 
the  two  regular  interfaces  between  the  three  levels,  there  arc 
two  horizontal  interfaces.  The  DBMS  has  to  provide  the  map¬ 
pings  between  the  user/conccptual  and  conccptual/physical 
interfaces  of  the  statistical  subsystem.  The  former  will  include 
mappings  from  functions  to  parameters  and  from  user  classifi¬ 
cations  to  base  classifications  (i.c.,  the  way  they  arc  derived). 
The  latter  interface  includes  mappings  to  actual  physical 
storage.  The  horizontal  mapping  at  the  user  level  exists 
because  classifications  arc  defined  on  user  category  attributes 
and  the  functions,  on  user  data  attributes  that  arc  part  of  a 
user  view.  The  same  explanation  holds  for  the  horizontal  map¬ 
ping  at  the  conceptual  level  except  that  additionally,  the  con¬ 
ceptual  statistical  component  may  have  to  make  requests 
directly  on  the  base  tables  -  for  instance,  when  the  parameters 
of  an  undcrivabic  category  arc  to  be  initially  accumulated.  No 
horizontal  interface  exists  at  the  physical  level,  implying  the 
independence  of  the  two  subsystems  at  this  level.  A  conse¬ 
quence  of  this,  is  that  it  is  possible  to  answer  some  statistical 
queries  without  the  presence  of  the  physical  database.  This  is 
not  to  say  that  the  two  physical  subsystems  cannot  reside  in 
the  same  physical  device  -  the  independence  is  logical. 


6.  Conclusion 

A  CS  c,aplurcs  the  essence  of  statistical  queries.  By  represent¬ 
ing  different  user  classifications  by  a  base  classification,  data 
sharing  is  possible.  In  addition,  redundant  calculations  arising 


from  dinerent  function  calculations  and  dilTcrent  users  have 
been  reduced.  Response  time  for  statistical  queries  will  also 
be  reduced  since  the  intermediate  result  (parameter)  acquisi¬ 
tion  phase,  that  involves  database  scans,  has  been  eliminated. 
A  CS  may  also  be  used  to  model  time  -  time,  being  a  category 
attribute. 

The  system  can  be  partially  implemented.  For  instance,  a  sys¬ 
tem  can  be  implemented  without  a  sophisticated  derivability 
capability.  This  may  be  the  case  when  it  is  known  that  all 
categories  are  always  disjoint  -  this  was  the  case  in  an  imple¬ 
mented  system.  The  updating  capability  at  a  per  record  basis 
may  be  substituted  with  a  higher  level  merge  procedure  that 
uses  the  direct  methods  for  calculating  the  parameters,  and 
the  additive  formulas  for  the  merge.  It  has  also  been  demon¬ 
strated  that  good  algorithms  exist  -  in  particular,  when 
category  attribute  domains  are  small  or  ordered. 

A  CS  does  not  come  without  a  cost.  Additional  rounding 
errors  are  incurred  when  updating  formulas  are  used  in  calcu¬ 
lating  parameter  values.  However,  better  accuracy  can  be 
achieved  if  the  parameters  are  kept  in  twice  the  precision  of 
the  final  function  values.  As  pointed  out,  downdating  results 
in  even  increased  error  [35].  Fortunately,  many  attribute 
domains  are  of  the  same  sign  and  if  high  accuracy  is  desired, 
separate  parameters  (but  same  type)  can  be  accumulated  for 
deleted  records  of  the  database.  Then  at  function<ompute 
time,  these  parameters  are  'subtracted'  from  the  cumulative 
ones  before  function  calculation.  Additional  cost  is  also 
incurred  in  the  extra  processing  required  for  database  activi¬ 
ties  related  to  data  acquisition  -  insertion,  deletion  and  modif¬ 
ication  -  if  parameters  are  to  be  updated.  This  suggests  an 
environment  with  infrequent  or  timed  updates  as  previously 
mentioned. 

Then  there  are  those  statistical  functions  that  are  expensive  to 
parameterize  in  the  sense  that  the  parameters  are  not  easily 
updatable.  Some  clever  methods  have  to  be  used  to  update 
these  functions  if  deletions  arc  a  frequent  occurence  since  a 
deletion  will  most  likely  invalidate  the  parameter,  resulting  in 
a  scan  of  the  database.  Some  statisical  functions,  such  as  resi¬ 
duals  in  multivariate  analyses,  cannot  be  performed  without  a 
database  scan. 


7.  Future  Direction 

Better  algorithms  arc  needed  for  some  of  the  problems  related 
to  derivability.  The  question  of  what  should  constitute  a  base 
classification  has  to  be  addressed.  Minimally,  this  classifica¬ 
tion  has  to  be  complete.  Given  such  a  classification,  polyno¬ 
mial  time  algorithms  exist  for  addition  and  deletion  of 
categories.  The  general  derivability  problem  was  shown  to  be 
NP-complete.  More  special  cases  need  to  be  identified  -  cases 
that  relate  to  structures  in  statistical  classification  schemes. 

Security  in  statistical  databases  has  been  a  major  concern  to 
database  designers  [8,  IS,  36).  Since  this  model  has  a  collec¬ 
tion  of  'statistical  abstracts',  it  may  be  worth  it  to  investigate 
how  this  abstract  can  be  used  in  the  various  schemes  for  com¬ 
bating  inferential  statistical  compromise.  Or,  it  may  be  possible 
to  develop  a  completely  new  scheme  given  these  abstracts. 

The  CS  has  been  discussed  without  any  direct  mention  of  any 
database  model.  In  the  network  model,  a  CS  can  translate  into 
a  DBTG  set  (9,  26).  The  omier  record  type  is  the  record  type 
that  contains  the  category  attribute  set  and  the  member  record 
type  is  the  one  that  contains  the  data  attribute  set.  If  the  same 
record  type  contains  both  the  category  and  data  attribute  sets, 
then  a  dummy  record  type  has  to  be  created  since  the  owner 
and  member  types  are  supposed  to  be  distinct.  In  a  relational 


model,  components  of  a  CS  correspond  directly  to  one  rela¬ 
tion  and  its  instance. 

In  any  database  m6del,  the  classification  system  can  be 
expanded  to  include  many  table  schemes  •  record  types  or 
relations.  Thus,  a  category  attribute  set  may  span  more  than 
one  table  (so  may  the  data  attribute  set).  It  is  then  necessary 
to  give  meaning  to  this  expanded  CS  in  the  context  of  the  par¬ 
ticular  database  model  -  in  particular,  with  regard  to  the  data 
manipulation  operators  of  the  database  model.  In  the  case  of 
the  relational  model,  to  determine  the  categories  whose 
parameters  are  to  be  updated,  given  that  an  insertion  has 
occured  in  a  relation  containing  part  of  the  category  attribute 
set,  is  not  an  easy  problem.  This  problem  is  directly  related  to 
the  general  problem  of  updates  in  a  relational  system  [13,  14). 
Updates  aside,  the  initial  gathering  of  parameter  values  needs 
proper  descriptions  of  procedures  that  are  not  immediately 
obvious. 

Efficient  implementation  data  structures  are  needed.  In  a  sys¬ 
tem  that  has  already  been  implemented,  an  array  linearization 
scheme  was  used  and  since  the  categories  were  disjoint 
derivation  was  easy  to  implement.  However,  a  more  sophisti¬ 
cated  data  structure  will  be  required  for  the  more  general  case 
of  non-disjoint  categories. 
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The  usual  multivariate  estimation  problems  reduce  to  optimization  problems 
involving  parameter  matrices,  which  may  be  patterned  due  to  constraints  or  for 
reasons  of  identif lability .  Matrix  derivatives,  where  allowance  is  made  for  patterns 
in  the  argument  matrix,  are  suggested,  and  permuted  identity  matrices  are  extended  to 
cover  partitioned  matrices.  Applications  are  made  to  parameter  estimation  of  a 
factor  analysis  model.  Subroutines  are  studied  for  computer  implementation. 


1.  INTRODUCTION 

Often  the  estimation  problems  in  multi¬ 
variate  statistics  are  optimization  problems 
Involving  several  parameters  in  the  form  of 
matrices,  which  may  be  partitioned  into  blocks. 
Sometimes  the  matrices  Involved  are  patterned 
matrices,  as  in  covariance  structure  models, 
where  conditions  are  Imposed  on  the  parameter 
matrices  for  the  model  to  be  identifiable.  We 
propose  a  method  to  obtain  matrix  derivatives 
of  a  matrix  function  of  matrix  arguments,  where 
the  argument  matrix  may  be  patterned.  We  also 
extend  the  notion  of  permuted  Identity  matrices 
to  matrices  with  row  blocks  or  column  blocks, 
in  order  to  take  care  of  partitioned  matrices. 

We  apply  some  of  these  notions  to  the  con¬ 
firmatory  inter-battery  factor  analysis  model, 
where  we  obtain  the  generalized  lea.st  squares 
estimators  of  the  parameter  matrices.  Some 
subroutines  are  suggested  for  computer  imple¬ 
mentation  of  the  method.  The  method  is  appli¬ 
cable  to  other  problems  where  the  objective 
function  is  a  function  of  matrices,  which  are 
possibly  patterned  due  to  constraints  among 
the  elements. 


2.  MATRIX  DERIVATIVES 


Let  Y  =  f(X)  be  a  matrix  function  of  a 
matrix  argument  X,  where  X  is  m^n  and  Y 
Is  p'^q.  Then  vec  X  denotes  the  column  vec¬ 
tor  of  order  mn  formed  by  stacking  the 
columns  of  X,  one  above  the  other,  starting 
with  the  first  column.  Matrix  Y  is  similarly 
"vectorized"  to  vec  Y,  and  displayed  as  a  row 
vector  vec*Y.  If  a  typical  element  of  vec’Y 
is  y  and  that  of  vec  X  is  x,,,  then  the 

‘^a3  ,  ij 


col  lection 


can  be  represented  by  a 


pq’^mn  matrix 


3vec  Y 


This  becomes  a  reprc- 


3vec  X* 

sentation  of  the  derivative  of  the  function 
with  respect  to  the  usual  bases  in  the  X 
space  and  the  Y  space,  and  Is  known  as  the 


matrix  derivative  of  Y  with  respect  to  matrix 
X. 

When  the  elements  of  X  have  equality  or 
other  relationships  between  them,  or  some  ele¬ 
ments  are  constants,  the  matrix  is  said  to  be 
patterned.  Examples  are  symmetric  or  skew- 
symmetric  matrices.  This  requires  a  modifica¬ 
tion  of  the  definition  of  a  matrix  derivative. 
Here  we  take  the  k  independent  and  variable 

elements  of  X  and  define  a  one  to  one  func- 

tion  J  on  ]R  onto  the  set  of  all  matrices 

D  with  this  particular  pattern.  We  take  the 

extension  f(X)  of  the  function  f(X)  to  the 
whole  space  of  all  m^n  matrices  by  ignoring 
the  pattern  of  X.  For  any  X  In  D,  we  have 
f(X)  «  f(X),  and  a  corresponding  vector  x  in 

R  such  that  J(x)  “  X.^  Now  consider  the  com¬ 
posite  function  G(x)  =  f®J(x).  Since  J(x)  “ 
we  have  G(x)  =  f(X).  Thus  we  can  define 
the  derivative  of  G(x)  by  using  the  chain  rule 
G'(x)  =  f*(J(x))(J'(x)) 

where  f*(J(x))  is  the  derivative  of  ?  at  the 
point  J(x).  By  taking  the  matrix  representa¬ 
tion  of  G'(x),  we  get 

Ig’(x)]  =  if*(J(x))JlJ’(x)]. 

Here  (f*(J{x)))  is  nothing  but  the  matrix 
derivative  obtained  by  ignoring  the  pattern  of 
X.  Thus  the  procedure  amounts  simply  to  post- 
mul tipJ icat ion  of  the  matrix  derivative, 
obtained  upon  Ignoring  the  pattern,  by  another 
matrix,  which  is  related  to  the  pattern  of  X. 


For  example,  let  X  = 


21 


sym¬ 


metric,  and  Y  =  f(X)  =  |x|  =  ^22  ”^21’  ^Sno- 

ring  the  pattern  of  X,  a  well  known  result  Is 

3vec  Y  i„i  i,„-l^  .  f^21^ 

=  1^!  >•  betting 

define  J(x)  =  X,  i.e. 


22 


\  22/  \  21  ^22/ 


3.  PERMUTED  IDENTITY  MATRICES 


3vec  Y 
3vec  X 

1).  The 

(-2x2^  1). 


These  were  Introduced  in  the  literature  to 
permute  the  rows  of  an  identity  matrix,  and 
were  used  to  relate  vec  A  and  vec  A*.  We 
extend  these  with  the  purpose  of  relating  the 
vector  of  a  partitioned  matrix  and  the  vectors 
of  the  blocks. 


Consider  the  partitions  of  positive  inte- 
r  s 

gers  m,n  as  m  ■  I  m.  and  n 
i=i  ^ 

the  identity  matrix  of  order  mn  may  be 
written  as 


I  n  .  Then 

j=i  J 


1  -  block  diagd  .....I  ,I  . 

mn  m,  m  ra. 


m 


...,I  .....I  ). 


n 

We  denote  by  t”''  the  matrix  obtained  by 
rearranging  the  row  blocks  of  the  above  matrix 
by  taking  every  r^^  block  starting  with  the 

first  block,  then  every  r^^  block  starting  with 
the  second  block,  and  so  on.  Interchanging  the 

roles  of  m  and  n  and  taking  every  block, 

we  obtain  the  matrix  T^"*. 

s 


Rearranging  the  row  blocks  by  taking  every  s 


block  starting  with  the  first,  we  obtain  T^ 

mn 

Interchanging  the  roles  of  m  and  n,  we  ob- 


tain 

T^ 

nm 

.  Then 

T*^  T®  - 
nm  mn 

I  . 
ran 

Let 

Z  be  an  mxn  matrix,  partitioned 

Into 

rs 

blocks 

r 

-  '  ‘  <^ij) 
8 

,  where  Z^j  is 

m  xn 

1  j 

,  m  1 

*  Z  m 
1-1  ^ 

,  n  -  Z  n^ . 

We  let  Rvec 

denote  the  formation  of  vectors  of  blocks  in 
the  row  order  and  Cvec  in  the  column  order. 
Thus 


Rvec  Z  »  (vec *Zj^j^,vec •  »vec  , 

vec  '^21*'  ‘ '  ' 

Cvec  L  -  (vec'Z^j^,vec*Z22^. . .  •  »vec*Z^^, 

vec'E^2,...,vec'!:^g)'. 

We  then  find  relationships  like 

T”’''vec  Z  -  Rvec  Z 
r 

vec  Z  -  T”’''Rvec  Z 
n 

_s  _ran  r.  « 

T  T  vec  L  ■  Cvec  L 

mn  r 

or  T®  vec  Z  =  Cvec  Z  . 
mn,r 

Some  results  on  @  products  [S], 

A  ®  B  •  UAjj  ®  ] ,  where  ®  denotes 

Kronecker  product  of  blocks  and 

can  be  related.  If  A  Is  mxn  and  B  Is 
t  u 

pxq,  p  «  r  p.  ,  q  »  E  q.,  we  have 

k=l  “  J.-1  *• 

,(B®A)  -  (b®a)t‘' 
mp,r  ^  nq,s 

t‘  (bQa)!""’’®  -  B®A 
np,r  q 


Next  we  partition 

1  ^  block  diag(I 

ran  m 


I 

mn 


as 


n  n 


th 

Rearranging  the  row  blocks  by  taking  every  n^ 

block  starting  with  the  first,  we  get  T^”,  and 

n 

interchanging  the  roles  of  m  and  n,  we  obtain 

T^"^.  Then  we  have  the  relationships 
^mn  ^mn  ^  ^  ^nm  ^nm  _  ^ 

r  n  ran  *  s  ra  mn ' 


Now  consider  the  partition 

^mn  ■  n  ’••• 

mn  m, n^ 

I 


ffl.n 


I  .1 

m,n  m-n. 
Is  z  1 


n  )• 

m  n 


If  X,Y  are  random  vectors  with  m,n 
components  respectively  and  E(X)  •  y, 

E(Y)  -  V,  Cov(Y,X*)  -  Z,  then  E(X(JTJY)  - 

T*^  vec  Z  +  iJ®v  and  for  X,Y  independent 
nm ,8  ..v  ^2\ 

Cov(X,X*)  =  Z^  \  Cov(Y,Y*)  -  t''  \ 

E(XY'®XY')  -  IT^  vec  +y®y] 

^  mm , r  ^ 

IT®  vec  E^^^  +  v®v] 
nn,s  ^ 

Cov(X®Y,(X@Y)’)  .  (E^^>  +  wm')  ®  (E^^^  + 
vv* )  -  PM '  ®  W  . 

A.  A  FACTOR  ANALYSIS  MODEL 

We  consider  the  application  of  the  above 
ideas  in  the  problem  of  estimation  of  para*> 
meters  for  the  confirmatory  Interbattery 
factor  analysis  model  [3].  Here  one  has  two 
sets  of  scores  and  X2  of  two  batteries 

of  tests  which  have  a  common  factor  z.  Let 


•  •  ♦ 


and  y2  denote  the  factors  specific  to 
batteries  1  and  2,  and  ^^*^2  error 

terms.  The  model  is  formulated  as 
*1  ■  “l  A^z  + 


*^2  ^2*  +  '’2^2  ■"  ®2 


where  battery  means  and 

A^,A2,  ^1*^2  corresponding  factor  loa¬ 


dings.  Further,  with  x 


e  *  ,  It  is  assumed  that  z  HCO,*!'), 

■  2 

~  N(O,0j),  ~  N(O.Yj),  S.  “  1.2,  and 

Cov(y,z')  «  0,  Cov(z,e’)  -  0,  Cov(y,e')  •  0, 


Cov(y^,yp  -  0,  Cov(e^,ep 


Also,  Yj^  end 


Y2  ere  assumed  to  be  diagonal  matrices  and  4*, 
0^  and  O2  are  symmetric  matrices.  The 

variance  covariance  matrix  Cov(x,x*)  of  the 
model  becomes 

/A^M’  +  r^eiP^  +  Yi  A^^A’ 

^ "  i 

A2itiA|  A2(tiA2+ r202r2  + Y2/ 


of  the  generalized  least  squares  method,  is 

HZ)  -  I  trl(S-Z)s'^l^ 

where  S  is  the  sample  variance  covariance 
matrix.  We  can  write  this  as 

£(5;)  -  j  vec'CS-DCs'^x  S"^)vec(S-!;). 

Since  S  ^  is  positive  definite,  so  also  Is 
S  Denoting  Its  symmetric  square  root 

by  U^,  we  have 

l(Z)  -  i  vec'CS-DU**  U**  vec(S-Z) 


ih(S)j'  hd) 


where  h(£)  -  vec(S-E). 

Letting  6  denote  the  vector  of  all  the 
unspecified  distinct  parameters  involved  in  the 
model,  the  gradient  of  f(Z)  is 


IJA(E)]  h(E) 


3h(S)  3  „>}  ... 

JA(Z)  “  2^^  “  M  ^  vec(S-E) 


U**  ^  vec(S-E) 


3vec  2 


T.  T. 

21  ^22 


In  order  that  the  model  be  identifiable, 
one  has  to  specify  some  of  the  elements  in  the 
matrices  Involved.  If  ^  Is  Mk,  0.  is 

2 

^1*^1*  ^2  have  to  Impose  k 

conditions  on  '*’*  conditions  on 

2 

and  and  conditions  on  r2  and  02* 

In  confirmatory  factor  analysis,  one  may  wish  to 
impose  conditions  on  factor  loadings,  instead  of 
on  the  covariance  matrices  of  the  factors. 

Thus,  besides  the  syomietry  of  0^^  and 

no  other  patterns  are  Imposed.  However, 

A|,A.),r,  and  T,.  become  patterned  matrices. 

1  z  1  ^ 

One  can  use  suitable  identity  matrices  as  sub¬ 
matrices  of  the  above  matrices,  so  that  the 
required  number  of  specifications  is  met.  For 
example,  we  can  have  as  a  submatrix  of 

and  A~,  1^  as  a  submatrix  of  V.  and  1. 

as  a  submatrix  of  r2. 

For  the  estimation  of  the  unspecified  para¬ 
meters  in  the  model,  either  the  maximum  likeli¬ 
hood  or  the  generalized  least  squares  method  is 
generally  used.  The  objective  function,  which 
is  a  function  of  £,  to  be  minimized  in  the  case 


=  -  ^33^^  ■  )  s'\xs’^  vec(S-E). 

Thus,  in  actual  calculation,  one  does  not  have 

li 

to  obtain  U  explicitly.  While  considering 
^  ,  we  find  that  because  of  the  pattern  of 

£,  finding  the  derivatives  of  Its  submatrices 
Is  much  simpler.  Let  ve  A  denote  the  vector 
of  unspecified  parameters  involved  in  a  sub- 

9vec  2^.  ^ 

matrix  A  of  Z.  Clearly  ^ —  is  easier 

9vec  Z 

to  calculate  than  tt" — i;— ,  i.e.,  it  is  easier 
3ve 


,  SRvec E  3Cvec E 

to  find  —3^  or  — ^ 


Using  permuted 


identity  matrices. 


^  3Rvec  E  _ 


T  is  ,  p  =  pj  +  P2,  Pj^»p2  being  the  num¬ 
ber  of  tests  In  the  first  and  the  second 
battery.  The  gradient  of  the  objective 
function  is 

3f(E)  3Rvec  E  ' 

“38 - -  \~W~  j  ^  ®  vec(S-E) . 

Clearly,  closed  form  estimates  are  not 


available  for  this  problem,  and  one  has  to  use 
a  computer  approach. 


JACE) 


,*S  9vcc  Z  ,,!}  _pp  3Rvec  I 
38  ■  p  30 


5 .  COMPUTER  IMPLEMENTATION 

In  most  of  the  optimization  routines  in 
practice,  the  user  has  to  provide  the  function 
and  the  gradient  of  the  function.  In  the  above 
example.  It  is  easy  to  program  the  matrices  T 

and  ~  •  latter  la  a  partitioned 

matrix  with  36  blocks,  only  16  of  which  are 
non-zero  blocks.  Thus,  corresponding  to 

the  non-zero  blocks  are 


(I  +  I  „  XA.tlx)!  )J 

‘’i  ®i 


J  ,  and  80  on.  The  J  matrices  are  the 
matrices  corresponding  to  the  patterns  of  the 
parameter  matrices  indicated. 

The  implementation  of  matrix  derivatives 
in  a  computer  program  amounts  to  finding  the 
products  of  matrices  as  above.  This  requires 
writing  subroutines  to  find  the  products  of 
matrices,  whereas  element  by  element  different 
tiation  would  have  become  very  cumbersome. 
Gauss^Newton  methods  are  popular  in  optimization 
problems  of  factor  analysis  [1],(4]. 

Usual  optimization  routines  consider  gene¬ 
ral  problems  with  constraints  among  the 
variables.  In  this  problem,  we  have  taken  care 
of  the  constraints  in  the  patterns  of  the  res¬ 
pective  matrices.  Thus  the  routine  employed 
should  allow  the  possibility  of  zero  constraints. 

We  consider  the  case  of  4  tests  in  the 
first  battery  and  3  in  the  second,  with  2  fac¬ 
tors  common  to  the  two  batteries  and  2  specific 
factors  for  each  battery.  Thus  I  is  9^9, 
is  4x2,  is  5x2,  is  4x2,  is 

5x2,  is  4x4,  y^  is  5x5,  and  <l>,  0^  and 
0^  are  each  2x2.  The  restrictions  require  the 
first  two  elements  of  the  first  column  of  A^ 

to  be  1  and  0,  and  the  first  two  elements  of 
the  second  column  of  A^  to  be  0  and  1  res¬ 
pectively.  The  upper  left  2x2  submatrices  of 
and  r2  are  I2.  Although  there  are  89 

variables  in  all,  only  42  of  then  are  to  be 
dealt  with  due  to  the  existence  of  patterns. 
There  are  3  symmetric  and  2  diagonal  matrices 
involved.  We  have  to  find  9  J  matrices  and  one 
T  matrix.  On  the  following  page,  we  supply  the 
subroutines  to  find  these. 


and  then 


(ja(E)]'ja(5:)  -  t'(s"^5jis~^)t 


38 


In  our  problem  ^  is  an  81x42  matrix 

with  36  blocks,  out  of  which  only  16  are  non¬ 
zero  blocks.  At  each  step  of  iteration, 

A  -  -  j(jA(Z)]’jA(!;)j'^  ^11^ 


is  to  be  calculated  and  6  is  updated  by 
6  + A  until  I  ^30^^  I  becomes  sufficiently 
small.  This  method  was  not  possible  because 
|^|jA(Z)|  JA(E)J  became  singular  at  certain 
stages . 


Thus  Marquardt-Levenberg  algorithm  was 
used,  where  the  updating  procedure  at  the  k 
step  is  to  calculate 


th 


A  -  -  (ja(j:)|  ’ja(j:)  +  ij  ^ 


3f(l) 

38 


where  ^  bounded  sequence  of  positive 

integers.  The  convergence  became  very  slow. 

At  the  last  step  we  used  the  subroutine 
ACDPAC  [2J,  providing  the  subroutines  to  cal¬ 
culate  f(E)  and  ^^9^'  bad  to  make 

slight  modifications  In  the  subroutine  to 
handle  the  case  of  zero  constraints.  We  find 
an  almost  perfect  fit. 
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In  order  to  Implement  the  Gauss-Newton 
algorithm  to  this  problem,  we  need  to  calculate 


n  n 


SUBROUTINE  JSYMET<K»M»N» JO) 
C 

C  THIS  SUBROUTINE  FINDS  THE 
C  J  HATRIX  CORRESPONDING  TO 
C  ANY  SYHMETRIC  PATTERN  MATRIX. 
C 

C  K=THE  ORDER  OF  THE  MATRIX 
C  M=K**2 
C  N=K*(K+l)/2 
C 

INTEGER  JO<M.N) 

DO  30  I»1»M 
DO  30  J-ltN 
J0(IiJ)-0 
30  CONTINUE 
L»0 
K1=K-1 

DO  100  I>ltKl 
L=L+1 

J0<  <I-l)*(K+l)  +  lfL)  =  l 
KI=K-I 

DO  100  J=lrKI 
L=L+1 

JO(<I-l)*<K+l)+J+lfL)=l 
JO<<I-l)*<K+l)+K+J+lrL)*l 
10  CONTINUE 

J0<M.N)-1 

RETURN 

END 


SUBROUTINE  JTOU< IP. L .LPML . IPL. JTM ) 
C  THIS  SUBROUTINE  FINDS  THE 
C  J  MATRIX  CORRESPONDING  TO 
C  THE  MATRICES  OF  THE  FORM  <I.B)' 

C  WHERE  I  IS  THE  UNIT  MATRIX. 

C 

C  THE  GIVEN  MATRIX  IS  OF  ORDER  (IP.L) 
C  THE  UNIT  MATRIX  IS  OF  ORDER  L. 

C  LPML=L*<IP-L) 

C  IPL=IP*L 
C 

INTEGER  JTW(IPL.LPML) 

DO  15  1=1. IPL 
DO  15  J=1.LPML 
JTW(I.J>-0 
15  CONTINUE 

IPML-IP-L 
DO  25  K-l.L 
DO  25  J=1.IPML 

Jrw<  <K-1 )*IP+L+J. <K-1 )*IPML+J)  =  1 
25  CONTINUE 
RETURN 


SUBROUTINE  JDI AG ( IP . IPP . JDIG ) 

C  THIS  SUBROUTINE  FINDS  THE 
C  J  MATRIX  CORRESPONDING  TO 
C  A  DIAGONAL  MATRIX. 

C 

C  IP=ORDER  OF  THE  GIVEN  MATRIX 
C  IPP»IP*#2 
C 

INTEGER  JDIG(IPP.IP) 

DO  10  1=1. IPP 
DO  10  J=1.IP 
JDIG(I.J)=0 
10  CONTINUE 

DO  20  1=1. IP 

JDIG(<I-1)*(IP+1)+1.J)=1 
20  CONTINUE 

RETURN 
END 
C 
C 

SUBROUTINE  TMATRX ( M . N . MV . IR . MN . I D > 
C 

C  THIS  SUBROUTINE  FINDS  THE 
C  Tmr..ri  MATRIX. 

C  M  AND  N  ARE  GIVEN  NUMBERS 
C  PARTITIONED  AS 

C  M=M(1)+M(2>  + . +M(IR)  AND 

C  N=N<1>IN<2)I . +N(IS) 

C  MV  IS  THE  VECTOR  OF  DIMENSION 
C  'IR'  CONTAINING  THE 
C  ELEMENTS  OF  THE  PARTITION  OF  M. 

C  MN=M*N 
C 

INTEGER  TMN(MN.MN) »MV(IR> 

INTEGER  ID(MN.MN) 

DO  100  1=1. MN 
DO  200  J=1.MN 
ID<1.J)=0 
20  CONTINUE 

ID(I.I>=1 
10  CONTINUE 

DO  300  L=1.MN 
DO  300  1=1. N 
MM=0 

DO  300  K=1.IR 
DO  400  J=1.MV(K) 

TMN<M*<I-1 )+MM+J.L)»ID(MM*N+ 
t  (I-1)*MV(K>+J.L) 

40  CONTINUE 

MM=MM+MV(K) 

30  CONTINUE 
RETURN 
END 
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A  method  for  obtaining  computer  generated  analytic  first  and  second  derivatives  is 
presented.  These  derivatives  are  used  in  the  fitting  of  nonlinear  models  defined  by 
systems  of  linear  differential  equations.  Second  derivative  information  offers  the 
possibilities  of  Improved  convergence  and  calculation  of  curvature  measures  of  non¬ 
linearity.  The  method  is  illustrated  with  a  C  program. 


Compartmental  models  are  an  important  class  of 
mathematical  models.  They  are  used  in  many 
fields,  but  a  pharmaceutical  example  will  illus¬ 
trate  their  characteristics.  Suppose  D  amount 
of  drug  is  introduced  into  the  blood  stream. 

The  drug  travels  to  its  site  of  action,  returns 
to  the  blood  stream  and  is  eliminated.  Pictor- 
ally  this  can  be  shown  as  follows 


blood  (Compartment  1)  at  time  t  and  X2(t)  repre¬ 
sents  the  amount  of  drug  at  the  site  of  action 
(Compartment  2)  at  time  t.  The  K's  are  called 
the  rate  constants.  Letting  Xj  =  Xj(t)  and 

X^  =  X2(t),  we  can  write 


dX, 

=  -(Ki+K^)Xi  +  K2X2 


dX, 

w  "  '^i’'2  ■  hh 


or  more  succinctly, 

ii  =  AX  (1) 

where 


'^2 


Xj(t) 

X2(t) 


Loosely  speaking,  this  says  that  in  the  inter¬ 
val  {t,t+At),  At  small,  Xj  decreases  (Kj+K^) 

Xj(t)At  and  Increases  K2X2(t)At.  Similarly, 
in  the  same  interval,  X2  Increases  KjXj(t)At 


and  decreases  K2X2(t)At.  The  concentration  of 

the  drug  in  Compartment  1  is  observed  by  taking 
blood  samples  at  time  t.  Statistically  we 
assume  the  observed  concentration  has  the  fol¬ 
lowing  structure 


where  V  is  the  volume  of  distribution  and  e^  is 

a  random  variable,  usually  assumed  to  be  N(0,a). 
The  more  general  model  of  Carroll  and  Ruppert 
(1984)  can  also  be  applied  with  the  obvious 
changes.  Since  Xj(t)  is  a  nonlinear  function  of 

t  and  K  =  (V,Kj ,K2,Kg) ,  a  nonlinear  least 

squares  method  is  used  to  estimate  K.  To  do 
this,  Xj  and  its  derivatives  with  respect  to  K 

are  required.  Xi  in  turn  requires  solving  (1). 
Both  finding  Xj  and  differentiating  Xj  with 

respect  to  K  is  tedious  and  subject  to  a  high 
probability  of  error.  Various  approximate 
methods  for  solving  Xj  and  estimating  dX/dK 

exist.  We  will  present  a  method  which  solves 
(1),  finds  dX/dK  and,  in  fact,  finds  partial 
derivatives  of  any  order.  This  method  was  in¬ 
spired  by  Jennrich  Bright  (1976).  This  paper 
presents  a  somewhat  different  approach  that  does 
not  require  distinct  eigenvalues  and  gives 
second  derivatives  as  well  as  first  derivatives. 
We  will  also  describe  some  applications  of 
second  derivatives  and  give  some  details  of  the 
method  of  implementation  in  the  C  programming 
language. 

We  must  solve 

it  =  AX,  A  is  cxc 

for  X.  We  will  assume  the  characteristic  roots 
of  A  are  all  real  but  they  need  not  be  distinct. 
Herron  (1963)  shows  that  this  condition  will 
hold  for  all  small  practical  problems.  A 
numerically  stable  method  for  solving  (1)  is  to 
first  find  the  real  Schur  decomposition  of  A, 

A  -  QTQ' 

where  T  is  an  upper  triangular  matrix  and  Q  is 


an  orthogonal  matrix.  The  diagonal  elements  of 
T  will  be  the  eigen  values  of  A.  Then, 

Q){  =  QAX 
■=  QAQ'QX 
=  Tu 

where  u  =  QX. 

The  element  of  u  satisfies  the  equation 


(1=l.---.c). 


this  is  easily  solved  for  u  and  likewise  the 
rest  of  (1)  can  be  solved  by  substituting  the 
solutions  for  u^  that  have  already  been  solved 

and  integrating  the  resulting  equation. 

It  may  arise  that  two  or  more  eigen  values  will 
be  equal.  In  that  case  a  power  of  t  will  enter 
in  the  solution.  The  general  solution  can  be 
represented  as 

maxdeg  j 
I  B‘'e(Xt) 
d=0 

where  e{Xt)^  =  (e(Xjt),e(X2t).---,e(X^t)). 

For  each  power  of  t  there  is  a  B**,  d=0, .maxdeg. 

The  set  of  8**  can  be  considered  as  a  three  dimen¬ 
sional  matrix  and  will  be  referred  to  as  B,  the 
solution  matrix  for  X. 

The  partial  derivatives  of  X  with  respect  to  K 
are  actually  not  hard  to  get.  Taking  the  partial 
derivative  of  the  system  with  respect  to  an  ele¬ 
ment  of  K,  say  K. , 

*=1  ax 
=  A  ’x  +  A^ 

K.  ®'^1 

Aj’  is  ±1  if  Aj^  has  ±K^  in  it.  If  the  order  of 

integration  of  the  right  hand  side  of  (2)  is 
changed 

■  (3) 

This  is  a  system  of  differential  equations  to  be 
solved  for  aX/aK. .  Since  we  already  know  X,  we 

can  rewrite  (3)  as 

^|^  =  A'<^B'*e(Xt)^A|L. 
Premultiplying  the  above  by  Q 

4^  =  A'^lB'‘e(Xt).QA|J- 

•  QA  ^  j;B‘*e(Xt)  +  QAQ'Q^ 


=  QA  ^  5;B‘‘e(Xt)  ♦  t|^  . 


Allen  (1981)  showed  that  the  solution  of  this 


system  for  aX/ak  has  the  form 

maxdeg  ^ 

I  C'^e(xt). 
d=0 

Second  derivatives  are  found  analogously.  To 
find  aX/3K^aKj  differentiate  (3)  with  respect 

to  Kj 

i_ax_  j  ax  ^ «  a  x 

3t'aK^3Kj^  "  akj  "  aK^  "aK^akj 
=  A'’lc5e(xt).A'jlC?e{xt).A^ 

-  I  (A''ic5A;)e(xt).A^  . 

Multiplying  by  Q, 

^^  =  n(A\^A%^)e(xt).T^  . 

Allen's  result  holds  and  the  solution  is  of  the 
form 

iDjje(xt). 

Clearly  higher  derivatives  are  possible  but  the 
expressions  are  correspondingly  more  complica¬ 
ted. 

The  determination  of  when  the  eigenvalues  are 
equal  is  potentially  the  hardest  part  of  making 
the  method  work.  We  can  tell  when  two  roots 
are  within  machine  epsilon  of  one  another,  but 
this  may  not  be  the  criteria  we  want.  For 
statistical  purposes,  we  may  deem  two  eigen¬ 
values  equal  long  before  a  numerical  analyst 
would.  The  analogy  In  the  linear  regression 
case  Is  that  the  X  matrix  may  be  nonsigular 
with  respect  to  machine  precision  but  the  con¬ 
fidence  intervals  for  the  parameters  are  so 
large  as  to  be  useless. 

We  haven't  determined  a  satisfactory  criteria 
yet.  In  the  C  subroutine  we  present,  we  side¬ 
step  this  by  introducing  a  function  cmplam. 

This  function  returns  I  if  the  eigenvalues  are 
determined  to  be  equal  whatever  criteria  we  are 
currently  using  and  9  if  not  equal. 

It  should  be  noted  that  the  eigenvectors  of  A 
are  not  found.  In  the  presence  of  equal  or  near 
equal  eigenvalues  ,  this  can  be  an  unstable 
calculation.  Also,  the  Schur  decomposition  of 
A  is  only  done  once  per  iteration  to  calculate 
all  orders  of  partial  derivatives. 

APPLICATIONS 

Second  derivative  information  is  useful  for  many 
problems.  We  will  briefly  describe  two  of  them. 

The  classical  Gauss  method  of  function  minimiza¬ 
tion  requires  the  second  derivative  matrix  of 
the  objective  function.  The  objective  function 
in  our  case  is  the  residual  sum  of  squares. 

It's  second  derivative  matrix  Is 


*.  ''.  - 


r'r  +  r'[r] 


r  =  y  -  X 


r  =  -^.  n  X  p 


■  aK  3K  P’’  P- 

The  bracket  notation  is  adopted  from  Bates  and 
Watts  [1980]  where  it  means  to  sum  over  the  sam¬ 
ple  space  index. 


We  may  not  want  to  perform  the  Gauss  step  at 
every  iteration  but  perhaps  every  p^h  step  as  a 
restart  for  the  Gauss  Newton.  It  could  also  be 
invoked  based  on  the  basis  of  convergence  cri¬ 
teria  or  line  search  failure. 


Another  use  of  second  derivative  information  are 
the  nonlinearity  measures  of  Bates  and  Watts. 

The  coordinates  of  the  second  derivative  matrix 
are  calculated  relative  to  an  orthogonal  basis 
of  the  tangent  plane.  These  coordinates  are 
used  to  construct  the  nonlinearity  measures  and 
are  useful  in  themselves  in  understanding  the 
nature  of  the  nonlinearity  in  the  problem.  To 
calculate  these  coordinates,  the  second  deriva¬ 
tive  only  has  to  be  formed  after  the  Gauss  Newton 
method  has  converged. 


IMPLEMENTATION 


The  proceeding  method  is  only  a  part  of  a  larger 
program  for  estimating  K.  We  need  some  way  of 
keeping  track  of  the  solution  matrices.  For  a 
problem  with  c  compartments  and  p  rate  constants, 
1  solution  matrix  is  required  for  the  system, 
p  for  the  partial  derivatives  and  p(p+1)/2  for 
the  second  derivatives.  One  way  to  keep  track 
of  these  matrices  is  to  keep  track  of  their 
addresses.  In  the  programming  language  C  we  can 
define  an  array  of  pointers.  For  example,  for 
second  order  partials  we  can  define 

double  *  D[i][j]. 


This  declares  0  to  be  a  two  dimensional  array  of 
pointers  to  double.  A  double  is  a  single  preci¬ 
sion  real  in  C.  We  could  have  defined  0  to  be 
an  array  of  pointers  tq  char  or  int  or  anything 
since  pointers  almost  always  have  the  same  length 
even  if  the  things  they  point  to  don't.  The 
address  of  the  (i,j)  solution  matrix  is  assigned 
to  D[i][j].  Referencing  the  (i,j)  element  of  0 
is  the  same  as  referencing  the  IJ  solution. 


We  could  predefine  all  the  matrices  we  might  at 
compile  time  and  then  assign  those  addresses  to 
the  pointer  arrays.  Or  we  can  allocate  matrices 
as  we  need  them.  There  are  C  library  functions 
for  demand  allocation  of  storage,  but  they  only 
return  a  block  of  memory,  not  any  specific  type 
such  as  an  array.  One  must  then  make  this 
memory  area  look  like  a  known  type  to  the  com¬ 
piler.  For  arrays  this  requires  setting  up  dope 
vectors  containing  the  right  Information  to  simu¬ 
late  a  compiler  generated  array. 


I 


For  a  problem  with  3  compartments  and  5  para¬ 
meters,  the  total  storage  for  all  the  solution 
would  be  around  20K  bytes  assuming  8  byte  reals. 
A  few  years  ago  this  might  be  one  half  to  one 
third  of  all  the  storage  available  to  a  typical 
microcomputer.  For  today's  and  future  micro¬ 
computers,  this  is  really  an  insignificant 
amount  of  storage  so  we  don't  try  to  manage  it 
as  efficiently  as  we  could. 


C  is  a  fast  and  flexible  language.  The  lack  of 
type  checking  has  good  and  bad  effects.  The 
ability  to  manipulate  pointers  allows  great 
freedom  but  can  also  produce  obscure  bugs.  How¬ 
ever,  assembly  language  programmers  have  lived 
without  benefits  of  any  protections  since  the 
beginnings  of  computers. 


The  programs  we  have  written  are  also  portable. 
By  being  compatible  with  UNIX*,  C  compilers  are 
also  compatible  with  each  other.  We  have 
transferred  C  code  written  for  a  Motorola  68000 
processor  to  a  VAX  running  4.2BSD  UNIX  and  have 
compiled  and  executed  without  change. 


To  improve  the  efficiency  of  a  computation  in¬ 
tensive  program  like  this  one,  significant 
portions  should  be  written  in  assembly  language. 
Languages  with  claimed  small  overhead  over 
assembly  language  achieve  that  small  overhead 
only  when  the  assembly  language  programmer  must 
play  by  the  same  rules  as  the  compiler.  Given 
complete  freedom,  the  assembly  written  program 
will  usually  be  much  faster  than  that  written 
in  a  higher  level  language.  Unfortunately,  the 
code  is  not  as  portable. 


The  following  subroutine  solves  the  system  of 
differential  equations.  If  nullb  is  1  then 
system  (1)  will  be  solved;  else  system  (3)  will 
be  solved. 


solvesys  (a,  b,  c,  m0,  lam,  rtc,  deg,  nullb) 


RERI.  aCMfiXNCl  CMRXNCl, 
bCMOXDEG] CMRXNC3  CMRXNCD, 
cCMRXDEGl CMRXNCl CMRXNCJ ( 

RERL  xeCtlRXNC],  lamCMRXNCTi 
int  nullb,  nc,  adeg; 

< 

int  i,  p,  d,  r,  k,  g,  tmd I 

RERL  temp,  fact,  powk,  lamdif) 


for  <p»nc-l |p> =0|p — ) 

< 

if  ( (p-^nc-l ) SR (nul 1 b) > 
< 

ct0T  Cp] Cpl-K0Cp] I 
cont inue; 

> 

tmd**  adeg  I 

for (k=0i k  <ncik++) 

< 

for  (d'”0id  <“tmd  |d-*-+) 

< 


*UN1X  is  a  trademark  of  AT&T. 
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tempB0.  0| 

for  (g-p+l  ig  <r>c|g++) 
temp  aCp3  Cg3*cCd3  Cg3  Ck3  ; 
if  (’nullb)  teriip+*=bCdD  Cp3  Ck3  I 
if  <ternp*«0.0)  contiriue; 

1 amd i f- 1 amCkS  —  lam  Cp3 | 

1 f  (cmpl am ( 2 am,  k,  p> ) 

< 

cCd+lJCp3Ck3  +=  tefnp/(d  +  l); 

> 

else 

< 

fact® 1 ( 

poMk® 1 amd i f | 

for  <r®0 1 <*d  )  r+4- ) 

< 

if  <r»*0)  cCd-r3Cp3Ck3  >=  temp/powkj 
else 

< 

powk  ««  larndlf; 
fact  (d-y'+ 1 )  ; 

cCd-r3  Cp3  Ck]-«-®<  fact  ♦temp)  /  powk  \ 

> 


if  <nullb)  cC0]Cp3Cp3  -  M0Cp3; 
for  < i®0| 1  <nc; i^^) 
if  n!-p>  CC03  Cp3  Cp3— cC03  Cp3  CiJ  j 
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